From 518710ca414245ee853dfdcf70ea872b18f27a68 Mon Sep 17 00:00:00 2001 From: Nick Miles Date: Sat, 20 Sep 2025 01:11:41 +0100 Subject: [PATCH 1/2] ENG-893 AWS Backup Restore Testing Validation & Integrity Design --- docs/manual-restore-validation.md | 137 ++++++++ docs/restore-testing-design.md | 312 ++++++++++++++++++ examples/customer-s3-validator/index.ts | 58 ++++ examples/customer-s3-validator/package.json | 16 + examples/customer-s3-validator/tsconfig.json | 14 + examples/manual-validation/main.tf | 83 +++++ .../aws-backup-manual-validation/README.md | 84 +++++ .../dist/orchestrator.js | 106 ++++++ modules/aws-backup-manual-validation/iam.tf | 76 +++++ .../aws-backup-manual-validation/lambda.tf | 45 +++ .../aws-backup-manual-validation/outputs.tf | 4 + .../aws-backup-manual-validation/package.json | 20 ++ .../src/orchestrator.ts | 125 +++++++ .../tsconfig.json | 16 + .../aws-backup-manual-validation/variables.tf | 43 +++ .../aws-backup-manual-validation/versions.tf | 9 + 16 files changed, 1148 insertions(+) create mode 100644 docs/manual-restore-validation.md create mode 100644 docs/restore-testing-design.md create mode 100644 examples/customer-s3-validator/index.ts create mode 100644 examples/customer-s3-validator/package.json create mode 100644 examples/customer-s3-validator/tsconfig.json create mode 100644 examples/manual-validation/main.tf create mode 100644 modules/aws-backup-manual-validation/README.md create mode 100644 modules/aws-backup-manual-validation/dist/orchestrator.js create mode 100644 modules/aws-backup-manual-validation/iam.tf create mode 100644 modules/aws-backup-manual-validation/lambda.tf create mode 100644 modules/aws-backup-manual-validation/outputs.tf create mode 100644 modules/aws-backup-manual-validation/package.json create mode 100644 modules/aws-backup-manual-validation/src/orchestrator.ts create mode 100644 modules/aws-backup-manual-validation/tsconfig.json create mode 100644 modules/aws-backup-manual-validation/variables.tf create mode 100644 modules/aws-backup-manual-validation/versions.tf diff --git a/docs/manual-restore-validation.md b/docs/manual-restore-validation.md new file mode 100644 index 0000000..e785bbc --- /dev/null +++ b/docs/manual-restore-validation.md @@ -0,0 +1,137 @@ +# Manual Restore Validation Design + +## 1. Purpose + +Provide a light‑weight, on-demand restore + validation workflow where the **customer supplies their own validation Lambda**. This complements automated restore testing plans by enabling ad-hoc integrity checks (e.g. regression assessment after schema change, pre-cutover rehearsal) without standing orchestration state machines. + +## 2. Overview + +Flow (single resource type per invocation): + +1. Operator (or CI job) invokes Orchestrator Lambda with optional `recoveryPointArn`. +2. Orchestrator chooses recovery point (latest if unspecified) from a backup vault. +3. Starts restore job using AWS Backup `StartRestoreJob`. +4. Polls status (`DescribeRestoreJob`) until terminal state. +5. Invokes customer validator Lambda with contextual payload. +6. Normalises validator response -> calls `PutRestoreValidationResult`. +7. Returns composite result to caller (for CLI / API inspection). + +No Step Functions required for typical short restore + validation cycles; for long running (>15 min) scenarios Step Functions could replace polling. + +## 3. Roles & Responsibilities + +| Component | Responsibility | +|-----------|----------------| +| Orchestrator Lambda | Restore initiation, polling, validator invocation, publishing result | +| Customer Validator Lambda | Domain/resource-specific integrity checks (S3 object presence, record counts, hashes, etc.) | +| AWS Backup | Recovery point catalog & restore execution | +| IAM | Enforces least privilege for restore & validation actions | + +## 4. Invocation Payload (Optional Fields) + +```json +{ + "recoveryPointArn": "arn:aws:backup:...:recovery-point:...", // optional override + "expectedKeys": ["path/example1.txt", "path/example2.txt"], // validator-specific + "expectedMinObjects": 10 // optional fallback +} +``` + +## 5. Validator Contract + +Input delivered to customer Lambda (superset of invocation + restore context): + +```json +{ + "restoreJobId": "...", + "recoveryPointArn": "...", + "resourceType": "S3", + "createdResourceArn": "arn:aws:s3:::restored-bucket", + "targetBucket": "restored-bucket", + "s3": { "bucket": "restored-bucket" }, + "expectedKeys": ["..."], + "expectedMinObjects": 10 +} +``` + +Return: + +```json +{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "summary", "details": { } } +``` +Status mapping is case-insensitive; unknown maps to FAILED. + +## 6. Security Considerations + +- Orchestrator policy limited to listing recovery points, starting & describing restore jobs, publishing validation, invoking a single validator ARN. +- Validator policy scoped to specific target bucket ARNs (S3 example). +- Sensitive data avoidance: orchestrator does not log object contents, only metadata. +- Optionally use a dedicated IAM restore role if restore requires cross-service access. + +## 7. Error Handling + +| Scenario | Behaviour | +|----------|-----------| +| No recovery points | Orchestrator throws error (non-validation) | +| Restore timeout | Error after 55m (FAILED not published) | +| Validator throws | Orchestrator records FAILED with parse/message fallback | +| Validator returns malformed JSON | Treated as FAILED with parse error message | + +## 8. Extensibility + +- Add multi-resource batch mode via Step Functions if needed. +- Support additional resource types by adjusting Metadata mapping (e.g. RDS cluster restore specifics). +- Emit custom metrics (future) for restore duration & validator latency. + +## 9. Example S3 Validator Patterns + +| Pattern | Description | +|---------|-------------| +| Key existence | Ensure enumerated critical objects are present (manifest-sourced) | +| Non-empty bucket | Basic continuity signal after restore | +| Minimum count | Validate approximate dataset size threshold | +| Sample integrity | (Future) HEAD + ETag comparison against manifest | + +## 10. Terraform Surfaces + +Module `aws-backup-manual-validation` variables: + +- `backup_vault_name` (string, required) +- `validation_lambda_arn` (string, required) +- `resource_type` (string, e.g. S3) +- `target_bucket_name` (string, S3 convenience) +- `name_prefix` (string) + +Outputs: + +- `orchestrator_lambda_arn` + +## 11. Invocation Examples + +AWS CLI (invoke latest recovery point): + +```bash +aws lambda invoke \ + --function-name myproj-dev-manual-restore-orchestrator \ + --payload '{}' out.json && cat out.json | jq +``` + +Explicit recovery point + expected keys: + +```bash +aws lambda invoke \ + --function-name myproj-dev-manual-restore-orchestrator \ + --payload '{"recoveryPointArn":"arn:aws:backup:..","expectedKeys":["manifest.json","data/file1"]}' out.json +``` + +## 12. Limitations + +- Long-running restores may exceed Lambda timeout (convert to Step Functions for scale/time). +- Only single resource restore per invocation. +- No built-in notification channel (user can layer SNS or EventBridge rule on Lambda logs/exits). + +## 13. Future Enhancements + +- Step Functions wrapper for large parallel restores. +- Parameter / Secrets retrieval for RDS validation credentials. +- Config-driven validator selection registry. diff --git a/docs/restore-testing-design.md b/docs/restore-testing-design.md new file mode 100644 index 0000000..50b666a --- /dev/null +++ b/docs/restore-testing-design.md @@ -0,0 +1,312 @@ +# AWS Backup Restore Testing Validation & Integrity Design + +## 1. Objectives + +Provide a blueprint extension that not only provisions AWS Backup Restore Testing Plans (already partially implemented via `awscc_backup_restore_testing_plan` and selections) but also validates that restored resources are *functional* and *internally consistent*. Users (blueprint implementers) define integrity checks per resource type (e.g. SQL query for RDS/Aurora, manifest verification for S3, item checks for DynamoDB) executed automatically after AWS Backup restore tests complete. + +## 2. High-Level Architecture + +![end-to-end visual of the event-driven validation workflow](diagrams/restore-validation-sequence.png) + +```text +AWS Backup Restore Testing Plan (scheduled) + │ (runs restore jobs) + ▼ +Restore Test Jobs (Test restore of latest/random recovery points) + │ emit EventBridge events (Restore Job State Change: COMPLETED) + ▼ +EventBridge Rule (filters status=COMPLETED + restoreTestingPlanArn) + │ + ▼ +Step Functions State Machine (or direct Lambda) <── optional batching fan‑in + 1. Fetch restore job details + 2. Dispatch per resource-type validator (Lambda / Fargate / custom) + 3. Execute user-defined integrity logic (SQL / API / S3 diff etc.) + 4. Aggregate results + 5. Call PutRestoreValidationResult (per restore job) + 6. Emit metrics + SNS / EventBridge notifications + │ + ▼ +CloudWatch Metrics / Logs / Alarms + Backup Console Validation Status +``` + +### Why Step Functions? + +- Orchestrates retries, parallel fan-out per restored resource +- Standardises timeout + backoff policies +- Simplifies conditional branching for resource types +- Enables centralised audit trail for validation workflow + +A simpler single Lambda path remains possible for minimal setups; design supports either. + +> For an ad-hoc, customer‑supplied validator workflow (manual restore + external Lambda validation without Step Functions), see `manual-restore-validation.md`. + +## 3. Data & Control Flows + +| Flow | Source → Target | Notes | +|------|-----------------|-------| +| A | AWS Backup → EventBridge | "Restore Job State Change" event, includes `restoreJobId`, `resourceType`, `createdResourceArn`, `restoreTestingPlanArn` | +| B | EventBridge → Step Functions | Input filtered by plan ARN / resource types | +| C | Step Functions → AWS Backup API | `DescribeRestoreJob` for enrichment | +| D | Step Functions → Validator Lambdas | One per resource type OR generic dispatcher | +| E | Validators → Target resource | Run integrity checks (SQL, scan, HEAD, etc.) | +| F | Validators → AWS Backup | `PutRestoreValidationResult(ValidationStatus=SUCCESSFUL\|FAILED\|SKIPPED)` | +| G | Step Functions → CloudWatch / SNS | Emit metrics, structured JSON log, optional alert | + +## 4. State Machine Definition (Express or Standard) + +Recommended: **Standard** (because restores may take hours; we only start after COMPLETED but validation might be longer running for large datasets). Express acceptable if you guarantee short validations. + +Proposed states (Amazon States Language pseudo): + +```json +{ + "Comment": "Restore Test Validation Orchestrator", + "StartAt": "Init", + + "States": { + "Init": { "Type": "Pass", "ResultPath": "$.context", "Next": "EnrichRestoreJob" }, + "EnrichRestoreJob": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:describeRestoreJob", "Parameters": { "RestoreJobId": "$.detail.restoreJobId" }, "ResultPath": "$.restoreJob", "Next": "RouteByResourceType" }, + "RouteByResourceType": { "Type": "Choice", "Choices": [ + { "Variable": "$.detail.resourceType", "StringEquals": "Aurora", "Next": "AuroraValidation" }, + { "Variable": "$.detail.resourceType", "StringEquals": "RDS", "Next": "RDSValidation" }, + { "Variable": "$.detail.resourceType", "StringEquals": "DynamoDB", "Next": "DynamoValidation" }, + { "Variable": "$.detail.resourceType", "StringEquals": "S3", "Next": "S3Validation" } + ], "Default": "GenericSkip" }, + "AuroraValidation": { "Type": "Task", "Resource": "${lambda_arn_aurora}" , "ResultPath": "$.validation", "Next": "PublishResult" }, + "RDSValidation": { "Type": "Task", "Resource": "${lambda_arn_rds}" , "ResultPath": "$.validation", "Next": "PublishResult" }, + "DynamoValidation": { "Type": "Task", "Resource": "${lambda_arn_dynamo}" , "ResultPath": "$.validation", "Next": "PublishResult" }, + "S3Validation": { "Type": "Task", "Resource": "${lambda_arn_s3}" , "ResultPath": "$.validation", "Next": "PublishResult" }, + "GenericSkip": { "Type": "Pass", "Result": { "status": "SKIPPED", "message": "No validator implemented for resourceType" }, "ResultPath": "$.validation", "Next": "PublishResult" }, + "PublishResult": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:putRestoreValidationResult", "Parameters": { "RestoreJobId": "$.detail.restoreJobId", "ValidationStatus": "$.validation.status", "ValidationStatusMessage": "$.validation.message" }, "Next": "EmitMetrics" }, + "EmitMetrics": { "Type": "Task", "Resource": "${lambda_arn_metrics}", "End": true } + } +} +``` + +Notes: + +- `${lambda_arn_*}` produced conditionally via Terraform based on enabled validators. +- Timeout & retry policies applied per Task (e.g. RDS 5 min, S3 2 min, Dynamo 1 min) with `Retry` blocks. +- Could collapse validators into one generic Lambda with plugin pattern. + +## 5. Extensibility Interface + +Users supply validation definitions via Terraform variables consumed by validator Lambda(s). + +### 5.1 Terraform Variables (additions) + +```hcl +variable "restore_validation_config" { + description = "Map keyed by resource type containing validation directives." + type = object({ + rds = optional(object({ + enabled = bool + cluster_identifiers = optional(list(string)) + sql_checks = list(object({ + database = string + statement = string + expected_rows = optional(number) + expected_hash = optional(string) # SHA256 of concatenated row values + timeout_seconds = optional(number) + })) + secret_arn = string # AWS Secrets Manager ARN for master creds or read-only + })) + dynamodb = optional(object({ + enabled = bool + tables = list(string) + checks = list(object({ + table = string + expected_item_count = optional(number) + key_sample = optional(list(object({ + pk = string + sk = optional(string) + expected_item_hash = optional(string) + }))) + })) + })) + s3 = optional(object({ + enabled = bool + buckets = list(object({ + name = string + manifest_s3_uri = optional(string) # points to authoritative manifest + sample_prefixes = optional(list(string)) + compare_object_tags = optional(bool) + })) + })) + aurora = optional(object({ + enabled = bool + clusters = list(string) + sql_checks = list(object({ + cluster_endpoint = optional(string) + database = string + statement = string + expected_rows = optional(number) + })) + secret_arn = string + })) + }) + default = {} +} +``` + + +### 5.2 Lambda Validator Contract + +All validator handlers accept unified event schema: + +```json +{ + "restoreJobId": "string", + "resourceType": "RDS|Aurora|DynamoDB|S3|...", + "createdResourceArn": "arn:aws:...", + "config": { "...resource specific config subset..." } +} +``` +Return object: + + +```json +{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "Human readable" } +``` + + +### 5.3 Packaging Strategy + +- Single Lambda with language (Python/Node) loads `config` JSON from SSM Parameter or encrypted file in S3 (to avoid large env variables) +- Pluggable validators registered in a dict keyed by resource type +- Optional user-provided Lambda ARN override per resource type for complete custom logic + +### 5.4 Validation Logic Patterns + +| Resource | Strategy | Failure Conditions | +|----------|----------|-------------------| +| RDS/Aurora | Execute SQL checks (each inside txn, read-only) | Query error, row count mismatch, hash mismatch, timeout | +| DynamoDB | DescribeTable + (optional) Scan limit or PartiQL key gets | Table missing, item count variance > threshold, sample hash mismatch | +| S3 | HEAD sample objects, optional compare against manifest (object key + size + etag) | Missing objects, size/etag mismatch, manifest not accessible | +| EBS (future) | (Optional) Attach test volume to temp instance and run FS metadata probe script | Attach failure, FS errors | + +## 6. Examples + +### 6.1 RDS Example Config + +```hcl +restore_validation_config = { + rds = { + enabled = true + secret_arn = aws_secretsmanager_secret.rds_ro.arn + sql_checks = [ + { database = "appdb", statement = "SELECT COUNT(*) c FROM customers", expected_rows = 1 }, + { database = "appdb", statement = "SELECT sha256(string_agg(id || ':' || status, ',' ORDER BY id)) h FROM orders", expected_hash = "abc123..." } + ] + } +} +``` + +### 6.2 DynamoDB Example Config + +```hcl +restore_validation_config = { + dynamodb = { + enabled = true + tables = ["orders", "customers"] + checks = [ + { table = "orders", expected_item_count = 15000 }, + { table = "customers", key_sample = [ { pk = "CUST#123", expected_item_hash = "d41d8cd98f" } ] } + ] + } +} +``` + +### 6.3 S3 Example Config + +```hcl +restore_validation_config = { + s3 = { + enabled = true + buckets = [{ + name = "images-bucket", + manifest_s3_uri = "s3://manifests-prod/images-bucket.manifest.json", + sample_prefixes = ["2025/09/", "2025/08/"] + }] + } +} +``` + +## 7. Security & Compliance + +- IAM: Validators assume dedicated role with least-privilege policies (RDS: `rds-data:ExecuteStatement` / `secretsmanager:GetSecretValue`; DynamoDB: `DescribeTable`, `GetItem`, limited `Scan` with `Limit`; S3: `HeadObject`, `GetObject` for manifest) +- Secrets: Use Secrets Manager for DB creds; do not log credentials or query data +- KMS: Encrypt Lambda environment variables, S3 manifest bucket, and Secrets Manager secret +- Network: For RDS/Aurora in private subnets, place Lambda in same VPC subnets with least required SG egress +- Auditing: Structured JSON logs (include `restoreJobId`, `resourceType`, check identifiers) +- PII Minimisation: Hash or count only; avoid selecting raw personal data rows +- Integrity of config: Optionally sign config file (S3 object with checksum validation before use) + +## 8. Operational Considerations & Cost + +- Throttle: Concurrency controls via Step Functions + reserved concurrency on validator Lambda to avoid storm after bulk restores +- Timeouts: Short per-check timeouts (e.g. 30s; fail fast pattern) +- Retention Window: If deeper validation requires longer retention, expose `retain_hours_before_cleanup` variable (aligns with AWS restore testing retention concept) +- Metrics: Emit CloudWatch custom metrics: `ValidationSuccess`, `ValidationFailure`, `ValidationDurationMs` with dimensions `ResourceType`, `PlanName` +- Alerting: SNS topic for failures >0 in last run, or error rate > threshold across rolling period +- Cost Levers: Limit number of SQL checks; use targeted `GetItem` vs full table scans; sample S3 objects (k=20 per prefix) unless manifest diff required + +## 9. Acceptance Criteria Mapping + +| Requirement | Design Element | +|------------|----------------| +| "Ability from the blueprint to run automated test to validate restoration" | EventBridge + Step Functions + validators triggered on restore completion | +| "Test integrity of restored resource, specific to blueprint implementer" | `restore_validation_config` + per-resource plugin architecture | +| "Define an SQL query for RDS to test integrity" | `sql_checks` array with expected rows/hash support | +| "Customer responsible for defining and validating check" | User supplies Terraform variable config and (optionally) custom Lambda override | +| "Step function would just allow this functionality" | State machine orchestrates and records results via `PutRestoreValidationResult` | + +## 10. Future Enhancements + +- Add cross-account validation (restore to isolated test account, assume role back) +- Support FSx / EFS mount probing using Fargate task +- Provide Terraform module subfolder `validation` generating Step Functions + default validator Lambda +- Add canned dashboards (CloudWatch) for validation pass rate & duration + +## 11. Terraform Module Additions (Summary) + +Minimal initial scope: + +1. New optional module `aws-backup-validation` OR integrated into `aws-backup-source` behind feature flag `enable_restore_validation` +2. Resources: + - EventBridge rule + - Step Functions state machine (JSON from templatefile) + - IAM roles/policies (state machine + lambda) + - Validator Lambda (zip from local build or external source) + - SSM Parameter / S3 object for config JSON +3. Variables: `enable_restore_validation`, `restore_validation_config`, `custom_validator_lambda_arns` (map) +4. Outputs: `restore_validation_state_machine_arn`, `restore_validation_config_parameter_arn` + +Current prototype implementation lives in `modules/aws-backup-validation` and provides a minimal Lambda + Step Functions + EventBridge rule path. Future iterations should harden IAM scoping and expand validator logic prior to production adoption. + +## 12. Example User Flow + +1. Enable restore testing (already done with existing plan resources) +2. Set `enable_restore_validation = true` +3. Provide `restore_validation_config` with at least one resource type +4. Apply Terraform – deploys validation infra +5. Wait for scheduled restore test; Step Functions records validation results +6. View status in AWS Backup Console / CloudWatch dashboard + +## 13. Risks & Mitigations + +| Risk | Mitigation | +|------|------------| +| Long-running SQL leads to Lambda timeout | Enforce per-query timeout + limit operations (SELECT only) | +| Validator failure blocks result publishing | Wrap each validator in try/catch; on unhandled exception mark FAILED with reason | +| Sensitive data leakage in logs | Scrub query parameters and row data; log only counts + hashes | +| Drift between Terraform config and live validator config | Version config (include checksum) and log version per run | +| Excess costs from scanning large DynamoDB tables | Use item count from `DescribeTable` and targeted sample keys, avoid full scans | + +## 14. Open Questions + +- Provide managed library of validation query templates? (Out of initial scope) +- Should retention hours be explicitly configurable per selection via Terraform? (Potential future variable) +- Add option for concurrency-limited validation queue (SQS + Lambda) instead of Step Functions? (Future scale consideration) + diff --git a/examples/customer-s3-validator/index.ts b/examples/customer-s3-validator/index.ts new file mode 100644 index 0000000..9192687 --- /dev/null +++ b/examples/customer-s3-validator/index.ts @@ -0,0 +1,58 @@ +import { S3Client, HeadObjectCommand, ListObjectsV2Command } from "@aws-sdk/client-s3"; + +const s3 = new S3Client({}); + +/* Example validator strategy: + 1. If event.expectedKeys provided -> verify each exists. + 2. Else if event.s3.bucket provided -> ensure bucket contains at least one object (or expectedMinObjects). + Return status + message summarising findings. +*/ + +interface EventShape { + restoreJobId: string; + recoveryPointArn: string; + resourceType: string; + createdResourceArn?: string; + targetBucket?: string; + s3?: { bucket?: string }; + expectedKeys?: string[]; + expectedMinObjects?: number; +} + +export const handler = async (event: EventShape) => { + const bucket = event.targetBucket || event.s3?.bucket; + if (!bucket) { + return { status: "SKIPPED", message: "No bucket specified" }; + } + + if (event.expectedKeys && event.expectedKeys.length > 0) { + const missing: string[] = []; + for (const key of event.expectedKeys) { + try { + await s3.send(new HeadObjectCommand({ Bucket: bucket, Key: key })); + } catch (e) { + missing.push(key); + } + } + if (missing.length > 0) { + return { status: "FAILED", message: `Missing ${missing.length} objects`, missing }; + } + return { status: "SUCCESSFUL", message: `All ${event.expectedKeys.length} expected objects present` }; + } + + // Fallback: simple non-empty check or min object threshold + const min = event.expectedMinObjects ?? 1; + let found = 0; + let ContinuationToken: string | undefined = undefined; + while (found < min) { + const resp = await s3.send(new ListObjectsV2Command({ Bucket: bucket, MaxKeys: 1000, ContinuationToken })); + const count = resp.Contents?.length || 0; + found += count; + if (!resp.IsTruncated) break; + ContinuationToken = resp.NextContinuationToken; + } + if (found < min) { + return { status: "FAILED", message: `Only ${found} objects found (< ${min})` }; + } + return { status: "SUCCESSFUL", message: `Found ${found} objects (>= ${min})` }; +}; diff --git a/examples/customer-s3-validator/package.json b/examples/customer-s3-validator/package.json new file mode 100644 index 0000000..0761b10 --- /dev/null +++ b/examples/customer-s3-validator/package.json @@ -0,0 +1,16 @@ +{ + "name": "customer-s3-validator-example", + "version": "0.1.0", + "private": true, + "type": "module", + "scripts": { + "build": "tsc -p tsconfig.json" + }, + "dependencies": { + "@aws-sdk/client-s3": "^3.637.0" + }, + "devDependencies": { + "typescript": "^5.4.0", + "@types/node": "^20.11.0" + } +} diff --git a/examples/customer-s3-validator/tsconfig.json b/examples/customer-s3-validator/tsconfig.json new file mode 100644 index 0000000..8cb4dd2 --- /dev/null +++ b/examples/customer-s3-validator/tsconfig.json @@ -0,0 +1,14 @@ +{ + "compilerOptions": { + "target": "ES2020", + "module": "ES2020", + "moduleResolution": "Node", + "outDir": "dist", + "rootDir": ".", + "esModuleInterop": true, + "strict": true, + "skipLibCheck": true + }, + "include": ["index.ts"], + "exclude": ["node_modules"] +} diff --git a/examples/manual-validation/main.tf b/examples/manual-validation/main.tf new file mode 100644 index 0000000..5210885 --- /dev/null +++ b/examples/manual-validation/main.tf @@ -0,0 +1,83 @@ +terraform { + required_version = ">= 1.5.0" + required_providers { + aws = { + source = "hashicorp/aws" + version = ">= 5.0" + } + } +} + +provider "aws" { + region = var.region +} + +variable "region" { type = string } +variable "name_prefix" { type = string } +variable "backup_vault_name" { type = string } +variable "restore_bucket" { type = string } + +# Example customer validator lambda (upload dist bundle manually or integrate build pipeline). +resource "aws_lambda_function" "customer_validator" { + function_name = "${var.name_prefix}-customer-s3-validator" + role = aws_iam_role.customer_validator.arn + handler = "index.handler" + runtime = "nodejs20.x" + filename = "./lambda_customer_validator.zip" # user supplied artifact + source_code_hash = filebase64sha256("./lambda_customer_validator.zip") + timeout = 60 + environment { + variables = {} + } +} + +resource "aws_iam_role" "customer_validator" { + name = "${var.name_prefix}-customer-s3-validator-role" + assume_role_policy = data.aws_iam_policy_document.lambda_assume.json +} + +data "aws_iam_policy_document" "lambda_assume" { + statement { + actions = ["sts:AssumeRole"] + principals { type = "Service" identifiers = ["lambda.amazonaws.com"] } + } +} + +resource "aws_iam_role_policy_attachment" "logs_attach_customer" { + role = aws_iam_role.customer_validator.name + policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole" +} + +resource "aws_iam_policy" "customer_s3_policy" { + name = "${var.name_prefix}-customer-s3-validator-policy" + policy = jsonencode({ + Version = "2012-10-17" + Statement = [ + { + Effect = "Allow" + Action = ["s3:ListBucket", "s3:GetObject", "s3:HeadObject"] + Resource = [ + "arn:aws:s3:::${var.restore_bucket}", + "arn:aws:s3:::${var.restore_bucket}/*" + ] + } + ] + }) +} + +resource "aws_iam_role_policy_attachment" "customer_validator_attach" { + role = aws_iam_role.customer_validator.name + policy_arn = aws_iam_policy.customer_s3_policy.arn +} + +module "manual_validation" { + source = "../../modules/aws-backup-manual-validation" + enable = true + name_prefix = var.name_prefix + backup_vault_name = var.backup_vault_name + resource_type = "S3" + validation_lambda_arn = aws_lambda_function.customer_validator.arn + target_bucket_name = var.restore_bucket +} + +output "orchestrator_lambda" { value = module.manual_validation.orchestrator_lambda_arn } diff --git a/modules/aws-backup-manual-validation/README.md b/modules/aws-backup-manual-validation/README.md new file mode 100644 index 0000000..4ec158c --- /dev/null +++ b/modules/aws-backup-manual-validation/README.md @@ -0,0 +1,84 @@ +# AWS Backup Manual Restore Validation Module + +Provides an on-demand Lambda **orchestrator** that: + +1. Selects a recovery point (latest by default) from a specified backup vault. +2. Starts a restore job for the chosen recovery point (supports S3 in example). +3. Waits for restore job completion (polling AWS Backup). +4. Invokes a **customer-provided validation Lambda** (you own resource-specific logic). +5. Publishes validation status back to AWS Backup using `PutRestoreValidationResult`. + +This pattern differs from automated restore testing plans: it is **manually triggered** (e.g. via `aws lambda invoke` or an API Gateway front-end) and delegates validation logic entirely to a customer-maintained Lambda. + +## Key Design Principles + +- **Separation of concerns**: Orchestrator handles restore lifecycle & result publishing; customer Lambda handles semantic integrity checks. +- **Pluggable**: Any runtime or language for validator (only contract is JSON in/out). +- **Minimal surface**: No Step Functions required for single-resource manual validation. + +## Orchestrator Environment Variables + +| Variable | Purpose | +|----------|---------| +| `BACKUP_VAULT_NAME` | Source vault to enumerate recovery points | +| `RESOURCE_TYPE` | Backup resource type (e.g. `S3`) | +| `VALIDATOR_LAMBDA` | ARN of customer validator Lambda | +| `TARGET_BUCKET` | (S3 only) Destination bucket name to validate | +| `RESTORE_ROLE_ARN` | (Optional) IAM role used for restore job | + +## Customer Validator Contract + +**Invocation Payload** (example): + +```json +{ + "restoreJobId": "1234abcd", + "recoveryPointArn": "arn:aws:backup:...:recovery-point:...", + "resourceType": "S3", + "createdResourceArn": "arn:aws:s3:::restored-bucket", + "targetBucket": "restored-bucket", + "s3": { "bucket": "restored-bucket" } +} +``` + +**Return Object**: + +```json +{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "Human readable summary" } +``` +Statuses are normalised by the orchestrator before calling `PutRestoreValidationResult`. + +## Terraform Inputs + +See `variables.tf` for full list. Essential: + +```hcl +module "manual_validation" { + source = "../modules/aws-backup-manual-validation" + enable = true + name_prefix = var.name_prefix + backup_vault_name = var.backup_vault_name + resource_type = "S3" + validation_lambda_arn = aws_lambda_function.customer_validator.arn + target_bucket_name = var.target_restore_bucket +} +``` + +## Example Validator (S3 Presence / Count) + +See `../../examples/customer-s3-validator` for a full TypeScript implementation scanning a set of expected keys or listing a prefix to ensure non-empty restore. + +## Operational Notes + +- Timeouts: Orchestrator Lambda default timeout is 15 minutes; long restores will exceed this—use small test datasets or adapt to Step Functions if needed. +- Costs: Avoid listing millions of S3 keys in the validator; prefer sampling. +- IAM Hardening: Current policy uses broad `backup:*` subset and `s3:Get*`; tighten to specific ARNs in production. + +## Future Enhancements + +- Option to specify explicit recovery point instead of auto-pick (supported already via event.recoveryPointArn field). +- Emit custom CloudWatch metrics for validation duration & success rate. +- Optional SNS notification on failure. + +--- +MIT style licensing per repository policy. diff --git a/modules/aws-backup-manual-validation/dist/orchestrator.js b/modules/aws-backup-manual-validation/dist/orchestrator.js new file mode 100644 index 0000000..9c19246 --- /dev/null +++ b/modules/aws-backup-manual-validation/dist/orchestrator.js @@ -0,0 +1,106 @@ +import { BackupClient, ListRecoveryPointsByBackupVaultCommand, StartRestoreJobCommand, DescribeRestoreJobCommand, PutRestoreValidationResultCommand } from "@aws-sdk/client-backup"; +import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda"; +import { S3Client } from "@aws-sdk/client-s3"; +const backup = new BackupClient({}); +const lambda = new LambdaClient({}); +const s3 = new S3Client({}); +const BACKUP_VAULT_NAME = process.env.BACKUP_VAULT_NAME; +const RESOURCE_TYPE = process.env.RESOURCE_TYPE; // e.g. S3 +const VALIDATOR_LAMBDA = process.env.VALIDATOR_LAMBDA; +const TARGET_BUCKET = process.env.TARGET_BUCKET; // optional S3 bucket +export const handler = async (event = {}) => { + console.log(JSON.stringify({ msg: "Manual restore orchestration start", event })); + const recoveryPointArn = event.recoveryPointArn || await pickLatestRecoveryPoint(); + console.log({ recoveryPointArn }); + const restoreJobId = await startRestore(recoveryPointArn); + console.log({ restoreJobId }); + const restoreDesc = await waitForCompletion(restoreJobId); + console.log({ restoreDesc }); + const validatorPayload = { + restoreJobId, + recoveryPointArn, + resourceType: RESOURCE_TYPE, + createdResourceArn: restoreDesc.CreatedResourceArn, + targetBucket: TARGET_BUCKET, + s3: { bucket: TARGET_BUCKET } + }; + const validationResult = await invokeValidator(validatorPayload); + console.log({ validationResult }); + await publishValidation(restoreJobId, validationResult); + return { + restoreJobId, + recoveryPointArn, + validation: validationResult + }; +}; +async function pickLatestRecoveryPoint() { + const cmd = new ListRecoveryPointsByBackupVaultCommand({ BackupVaultName: BACKUP_VAULT_NAME, MaxResults: 20 }); + const resp = await backup.send(cmd); + if (!resp.RecoveryPoints || resp.RecoveryPoints.length === 0) { + throw new Error("No recovery points found in vault"); + } + const sorted = [...resp.RecoveryPoints].sort((a, b) => (b.CreationDate?.getTime() || 0) - (a.CreationDate?.getTime() || 0)); + return sorted[0].RecoveryPointArn; +} +async function startRestore(recoveryPointArn) { + const cmd = new StartRestoreJobCommand({ + RecoveryPointArn: recoveryPointArn, + IamRoleArn: process.env.RESTORE_ROLE_ARN, + ResourceType: RESOURCE_TYPE, + Metadata: TARGET_BUCKET ? { destinationBucketName: TARGET_BUCKET } : {} + }); + const resp = await backup.send(cmd); + if (!resp.RestoreJobId) + throw new Error("StartRestoreJob returned no RestoreJobId"); + return resp.RestoreJobId; +} +async function waitForCompletion(restoreJobId) { + const timeoutMs = 1000 * 60 * 55; + const start = Date.now(); + while (Date.now() - start < timeoutMs) { + const desc = await backup.send(new DescribeRestoreJobCommand({ RestoreJobId: restoreJobId })); + if (desc.Status === "COMPLETED" || desc.Status === "ABORTED" || desc.Status === "FAILED") { + return desc; + } + await new Promise(r => setTimeout(r, 15000)); + } + throw new Error("Restore job did not finish within timeout"); +} +async function invokeValidator(payload) { + const cmd = new InvokeCommand({ + FunctionName: VALIDATOR_LAMBDA, + InvocationType: "RequestResponse", + Payload: Buffer.from(JSON.stringify(payload)) + }); + const resp = await lambda.send(cmd); + if (!resp.Payload) + throw new Error("Validator returned no payload"); + const txt = Buffer.from(resp.Payload).toString("utf-8"); + try { + return JSON.parse(txt); + } catch (e) { + throw new Error("Validator payload JSON parse error: " + txt); + } +} +async function publishValidation(restoreJobId, result) { + const status = mapStatus(result.status); + const message = (result.message || "").slice(0, 1000); + const cmd = new PutRestoreValidationResultCommand({ + RestoreJobId: restoreJobId, + ValidationStatus: status, + ValidationStatusMessage: message + }); + await backup.send(cmd); +} +function mapStatus(s) { + if (!s) + return "FAILED"; + const upper = s.toUpperCase(); + if (["SUCCESS", "SUCCESSFUL", "OK"].includes(upper)) + return "SUCCESSFUL"; + if (["FAILED", "FAIL", "ERROR"].includes(upper)) + return "FAILED"; + if (["SKIPPED", "IGNORE", "IGNORED"].includes(upper)) + return "SKIPPED"; + return "FAILED"; +} diff --git a/modules/aws-backup-manual-validation/iam.tf b/modules/aws-backup-manual-validation/iam.tf new file mode 100644 index 0000000..1601dc3 --- /dev/null +++ b/modules/aws-backup-manual-validation/iam.tf @@ -0,0 +1,76 @@ +locals { + manual_validation_name = "${var.name_prefix}-manual-restore-validation" +} + +resource "aws_iam_role" "orchestrator" { + count = var.enable ? 1 : 0 + name = "${local.manual_validation_name}-orchestrator" + assume_role_policy = data.aws_iam_policy_document.orchestrator_assume.json +} + +data "aws_iam_policy_document" "orchestrator_assume" { + statement { + actions = ["sts:AssumeRole"] + principals { + type = "Service" + identifiers = ["lambda.amazonaws.com"] + } + } +} + +# NOTE: Permissions are intentionally broad placeholders; should be tightened. +# Includes: listing recovery points, starting restore job, describing restore job, +# invoking customer validation Lambda, writing logs, optional S3 read. + +data "aws_iam_policy_document" "orchestrator" { + statement { + sid = "Logs" + actions = [ + "logs:CreateLogGroup", + "logs:CreateLogStream", + "logs:PutLogEvents" + ] + resources = ["*"] + } + + statement { + sid = "BackupCore" + actions = [ + "backup:ListRecoveryPointsByBackupVault", + "backup:StartRestoreJob", + "backup:DescribeRestoreJob", + "backup:PutRestoreValidationResult" + ] + resources = ["*"] + } + + statement { + sid = "InvokeValidator" + actions = [ + "lambda:InvokeFunction" + ] + resources = [var.validation_lambda_arn] + } + + statement { + sid = "S3ReadOptional" + actions = [ + "s3:ListBucket", + "s3:GetObject", + "s3:HeadObject" + ] + resources = ["*"] + } +} + +resource "aws_iam_policy" "orchestrator" { + count = var.enable ? 1 : 0 + name = "${local.manual_validation_name}-policy" + policy = data.aws_iam_policy_document.orchestrator.json +} + +resource "aws_iam_role_policy_attachment" "orchestrator" { + count = var.enable ? 1 : 0 + role = aws_iam_role.orchestrator[0].name + policy_arn = aws_iam_policy.orchestrator[0].arn +} diff --git a/modules/aws-backup-manual-validation/lambda.tf b/modules/aws-backup-manual-validation/lambda.tf new file mode 100644 index 0000000..58d6de6 --- /dev/null +++ b/modules/aws-backup-manual-validation/lambda.tf @@ -0,0 +1,45 @@ +locals { + orchestrator_src_dir = "${path.module}/src" +} + +resource "aws_cloudwatch_log_group" "orchestrator" { + count = var.enable ? 1 : 0 + name = "/aws/lambda/${aws_lambda_function.orchestrator[0].function_name}" + retention_in_days = 30 +} + +# We keep a pre-built JS file for simplicity; user can rebuild if modifying. +# (If a build step is desired, integrate external build pipeline.) + +data "archive_file" "orchestrator" { + type = "zip" + source_file = "${path.module}/dist/orchestrator.js" + output_path = "${path.module}/dist/orchestrator.zip" +} + +resource "aws_lambda_function" "orchestrator" { + count = var.enable ? 1 : 0 + function_name = "${var.name_prefix}-manual-restore-orchestrator" + role = aws_iam_role.orchestrator[0].arn + handler = "orchestrator.handler" + runtime = "nodejs20.x" + filename = data.archive_file.orchestrator.output_path + source_code_hash = data.archive_file.orchestrator.output_base64sha256 + timeout = 900 + memory_size = 256 + + environment { + variables = { + BACKUP_VAULT_NAME = var.backup_vault_name + RESOURCE_TYPE = var.resource_type + VALIDATOR_LAMBDA = var.validation_lambda_arn + TARGET_BUCKET = var.target_bucket_name + } + } + tags = var.tags +} + +output "manual_restore_orchestrator_lambda_arn" { + value = try(aws_lambda_function.orchestrator[0].arn, null) + description = "ARN of the manual restore orchestrator lambda" +} diff --git a/modules/aws-backup-manual-validation/outputs.tf b/modules/aws-backup-manual-validation/outputs.tf new file mode 100644 index 0000000..6cc5e34 --- /dev/null +++ b/modules/aws-backup-manual-validation/outputs.tf @@ -0,0 +1,4 @@ +output "orchestrator_lambda_arn" { + value = try(aws_lambda_function.orchestrator[0].arn, null) + description = "Manual restore validation orchestrator Lambda ARN" +} diff --git a/modules/aws-backup-manual-validation/package.json b/modules/aws-backup-manual-validation/package.json new file mode 100644 index 0000000..fc5555b --- /dev/null +++ b/modules/aws-backup-manual-validation/package.json @@ -0,0 +1,20 @@ +{ + "name": "aws-backup-manual-validation-orchestrator", + "version": "0.1.0", + "private": true, + "type": "module", + "scripts": { + "build": "tsc --project tsconfig.json", + "clean": "rimraf dist" + }, + "dependencies": { + "@aws-sdk/client-backup": "^3.637.0", + "@aws-sdk/client-lambda": "^3.637.0", + "@aws-sdk/client-s3": "^3.637.0" + }, + "devDependencies": { + "typescript": "^5.4.0", + "@types/node": "^20.11.0", + "rimraf": "^5.0.5" + } +} diff --git a/modules/aws-backup-manual-validation/src/orchestrator.ts b/modules/aws-backup-manual-validation/src/orchestrator.ts new file mode 100644 index 0000000..eda153e --- /dev/null +++ b/modules/aws-backup-manual-validation/src/orchestrator.ts @@ -0,0 +1,125 @@ +/* Orchestrator Lambda (TypeScript) + Triggers a manual restore job for a chosen recovery point and invokes a customer-provided validation Lambda. + The customer Lambda should return JSON: { status: "SUCCESSFUL|FAILED|SKIPPED", message: string } +*/ +import { BackupClient, ListRecoveryPointsByBackupVaultCommand, StartRestoreJobCommand, DescribeRestoreJobCommand, PutRestoreValidationResultCommand } from "@aws-sdk/client-backup"; +import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda"; +import { S3Client, HeadObjectCommand } from "@aws-sdk/client-s3"; + +const backup = new BackupClient({}); +const lambda = new LambdaClient({}); +const s3 = new S3Client({}); + +const BACKUP_VAULT_NAME = process.env.BACKUP_VAULT_NAME!; +const RESOURCE_TYPE = process.env.RESOURCE_TYPE!; // e.g. S3 +const VALIDATOR_LAMBDA = process.env.VALIDATOR_LAMBDA!; +const TARGET_BUCKET = process.env.TARGET_BUCKET; // optional S3 bucket + +interface ValidatorResult { status: string; message?: string; [k: string]: any } + +export const handler = async (event: any = {}): Promise => { + console.log(JSON.stringify({ msg: "Manual restore orchestration start", event })); + + const recoveryPointArn = event.recoveryPointArn || await pickLatestRecoveryPoint(); + console.log({ recoveryPointArn }); + + const restoreJobId = await startRestore(recoveryPointArn); + console.log({ restoreJobId }); + + const restoreDesc = await waitForCompletion(restoreJobId); + console.log({ restoreDesc }); + + const validatorPayload = { + restoreJobId, + recoveryPointArn, + resourceType: RESOURCE_TYPE, + createdResourceArn: restoreDesc.CreatedResourceArn, + targetBucket: TARGET_BUCKET, + // Additional S3 example context the customer validator might use: + s3: { bucket: TARGET_BUCKET } + }; + + const validationResult = await invokeValidator(validatorPayload); + console.log({ validationResult }); + + await publishValidation(restoreJobId, validationResult); + + return { + restoreJobId, + recoveryPointArn, + validation: validationResult + }; +}; + +async function pickLatestRecoveryPoint(): Promise { + const cmd = new ListRecoveryPointsByBackupVaultCommand({ BackupVaultName: BACKUP_VAULT_NAME, MaxResults: 20 }); + const resp = await backup.send(cmd); + if (!resp.RecoveryPoints || resp.RecoveryPoints.length === 0) { + throw new Error("No recovery points found in vault"); + } + // Sort by CreationDate descending + const sorted = [...resp.RecoveryPoints].sort((a, b) => (b.CreationDate?.getTime() || 0) - (a.CreationDate?.getTime() || 0)); + return sorted[0].RecoveryPointArn!; +} + +async function startRestore(recoveryPointArn: string): Promise { + // For S3 we can do a metadata-only restore or specify a placeholder + const cmd = new StartRestoreJobCommand({ + RecoveryPointArn: recoveryPointArn, + IamRoleArn: process.env.RESTORE_ROLE_ARN, + ResourceType: RESOURCE_TYPE, + Metadata: TARGET_BUCKET ? { destinationBucketName: TARGET_BUCKET } : {} + }); + const resp = await backup.send(cmd); + if (!resp.RestoreJobId) throw new Error("StartRestoreJob returned no RestoreJobId"); + return resp.RestoreJobId; +} + +async function waitForCompletion(restoreJobId: string) { + const timeoutMs = 1000 * 60 * 55; // 55 minutes safety + const start = Date.now(); + while (Date.now() - start < timeoutMs) { + const desc = await backup.send(new DescribeRestoreJobCommand({ RestoreJobId: restoreJobId })); + if (desc.Status === "COMPLETED" || desc.Status === "ABORTED" || desc.Status === "FAILED") { + return desc; + } + await new Promise(r => setTimeout(r, 15000)); + } + throw new Error("Restore job did not finish within timeout"); +} + +async function invokeValidator(payload: any): Promise { + const cmd = new InvokeCommand({ + FunctionName: VALIDATOR_LAMBDA, + InvocationType: "RequestResponse", + Payload: Buffer.from(JSON.stringify(payload)) + }); + const resp = await lambda.send(cmd); + if (!resp.Payload) throw new Error("Validator returned no payload"); + const txt = Buffer.from(resp.Payload).toString("utf-8"); + try { + return JSON.parse(txt); + } catch (e) { + throw new Error("Validator payload JSON parse error: " + txt); + } +} + +async function publishValidation(restoreJobId: string, result: ValidatorResult) { + const status = mapStatus(result.status); + const message = (result.message || "").slice(0, 1000); + const cmd = new PutRestoreValidationResultCommand({ + RestoreJobId: restoreJobId, + ValidationStatus: status, + ValidationStatusMessage: message + }); + await backup.send(cmd); +} + +function mapStatus(s?: string): string { + if (!s) return "FAILED"; + const upper = s.toUpperCase(); + if (["SUCCESS", "SUCCESSFUL", "OK"].includes(upper)) return "SUCCESSFUL"; + if (["FAILED", "FAIL", "ERROR"].includes(upper)) return "FAILED"; + if (["SKIPPED", "IGNORE", "IGNORED"].includes(upper)) return "SKIPPED"; + return "FAILED"; +} diff --git a/modules/aws-backup-manual-validation/tsconfig.json b/modules/aws-backup-manual-validation/tsconfig.json new file mode 100644 index 0000000..01dcc7f --- /dev/null +++ b/modules/aws-backup-manual-validation/tsconfig.json @@ -0,0 +1,16 @@ +{ + "compilerOptions": { + "target": "ES2020", + "module": "ES2020", + "moduleResolution": "Node", + "outDir": "dist", + "rootDir": "src", + "esModuleInterop": true, + "forceConsistentCasingInFileNames": true, + "strict": true, + "skipLibCheck": true, + "resolveJsonModule": true + }, + "include": ["src/**/*.ts"], + "exclude": ["node_modules"] +} diff --git a/modules/aws-backup-manual-validation/variables.tf b/modules/aws-backup-manual-validation/variables.tf new file mode 100644 index 0000000..4a18a79 --- /dev/null +++ b/modules/aws-backup-manual-validation/variables.tf @@ -0,0 +1,43 @@ +variable "enable" { + type = bool + default = true + description = "Whether to create manual validation orchestration resources." +} + +variable "name_prefix" { + type = string + description = "Prefix used for naming resources (e.g. project-env)." +} + +variable "backup_vault_name" { + type = string + description = "Name of the backup vault containing recovery points to restore for manual tests." +} + +variable "restore_role_arn" { + type = string + description = "IAM role ARN used by the restore job if a specific role is required (optional)." + default = null +} + +variable "validation_lambda_arn" { + type = string + description = "Customer-provided Lambda ARN that performs validation after manual restore completes." +} + +variable "resource_type" { + type = string + description = "AWS Backup resource type for manual restore (e.g. S3, DynamoDB, RDS)." +} + +variable "target_bucket_name" { + type = string + description = "For S3 restores: name of the destination S3 bucket that the restore will produce or populate. Used only in the example orchestrator logic." + default = null +} + +variable "tags" { + type = map(string) + default = {} + description = "Tags to apply to created resources." +} diff --git a/modules/aws-backup-manual-validation/versions.tf b/modules/aws-backup-manual-validation/versions.tf new file mode 100644 index 0000000..7f163ea --- /dev/null +++ b/modules/aws-backup-manual-validation/versions.tf @@ -0,0 +1,9 @@ +terraform { + required_version = ">= 1.5.0" + required_providers { + aws = { + source = "hashicorp/aws" + version = ">= 5.0" + } + } +} From c81b0fc8e4ab0c7654bfe8c554a1948431bd018d Mon Sep 17 00:00:00 2001 From: Nick Miles Date: Sat, 20 Sep 2025 01:19:58 +0100 Subject: [PATCH 2/2] ENG-893 Remove accidental inclusion --- docs/restore-testing-design.md | 312 --------------------------------- 1 file changed, 312 deletions(-) delete mode 100644 docs/restore-testing-design.md diff --git a/docs/restore-testing-design.md b/docs/restore-testing-design.md deleted file mode 100644 index 50b666a..0000000 --- a/docs/restore-testing-design.md +++ /dev/null @@ -1,312 +0,0 @@ -# AWS Backup Restore Testing Validation & Integrity Design - -## 1. Objectives - -Provide a blueprint extension that not only provisions AWS Backup Restore Testing Plans (already partially implemented via `awscc_backup_restore_testing_plan` and selections) but also validates that restored resources are *functional* and *internally consistent*. Users (blueprint implementers) define integrity checks per resource type (e.g. SQL query for RDS/Aurora, manifest verification for S3, item checks for DynamoDB) executed automatically after AWS Backup restore tests complete. - -## 2. High-Level Architecture - -![end-to-end visual of the event-driven validation workflow](diagrams/restore-validation-sequence.png) - -```text -AWS Backup Restore Testing Plan (scheduled) - │ (runs restore jobs) - ▼ -Restore Test Jobs (Test restore of latest/random recovery points) - │ emit EventBridge events (Restore Job State Change: COMPLETED) - ▼ -EventBridge Rule (filters status=COMPLETED + restoreTestingPlanArn) - │ - ▼ -Step Functions State Machine (or direct Lambda) <── optional batching fan‑in - 1. Fetch restore job details - 2. Dispatch per resource-type validator (Lambda / Fargate / custom) - 3. Execute user-defined integrity logic (SQL / API / S3 diff etc.) - 4. Aggregate results - 5. Call PutRestoreValidationResult (per restore job) - 6. Emit metrics + SNS / EventBridge notifications - │ - ▼ -CloudWatch Metrics / Logs / Alarms + Backup Console Validation Status -``` - -### Why Step Functions? - -- Orchestrates retries, parallel fan-out per restored resource -- Standardises timeout + backoff policies -- Simplifies conditional branching for resource types -- Enables centralised audit trail for validation workflow - -A simpler single Lambda path remains possible for minimal setups; design supports either. - -> For an ad-hoc, customer‑supplied validator workflow (manual restore + external Lambda validation without Step Functions), see `manual-restore-validation.md`. - -## 3. Data & Control Flows - -| Flow | Source → Target | Notes | -|------|-----------------|-------| -| A | AWS Backup → EventBridge | "Restore Job State Change" event, includes `restoreJobId`, `resourceType`, `createdResourceArn`, `restoreTestingPlanArn` | -| B | EventBridge → Step Functions | Input filtered by plan ARN / resource types | -| C | Step Functions → AWS Backup API | `DescribeRestoreJob` for enrichment | -| D | Step Functions → Validator Lambdas | One per resource type OR generic dispatcher | -| E | Validators → Target resource | Run integrity checks (SQL, scan, HEAD, etc.) | -| F | Validators → AWS Backup | `PutRestoreValidationResult(ValidationStatus=SUCCESSFUL\|FAILED\|SKIPPED)` | -| G | Step Functions → CloudWatch / SNS | Emit metrics, structured JSON log, optional alert | - -## 4. State Machine Definition (Express or Standard) - -Recommended: **Standard** (because restores may take hours; we only start after COMPLETED but validation might be longer running for large datasets). Express acceptable if you guarantee short validations. - -Proposed states (Amazon States Language pseudo): - -```json -{ - "Comment": "Restore Test Validation Orchestrator", - "StartAt": "Init", - - "States": { - "Init": { "Type": "Pass", "ResultPath": "$.context", "Next": "EnrichRestoreJob" }, - "EnrichRestoreJob": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:describeRestoreJob", "Parameters": { "RestoreJobId": "$.detail.restoreJobId" }, "ResultPath": "$.restoreJob", "Next": "RouteByResourceType" }, - "RouteByResourceType": { "Type": "Choice", "Choices": [ - { "Variable": "$.detail.resourceType", "StringEquals": "Aurora", "Next": "AuroraValidation" }, - { "Variable": "$.detail.resourceType", "StringEquals": "RDS", "Next": "RDSValidation" }, - { "Variable": "$.detail.resourceType", "StringEquals": "DynamoDB", "Next": "DynamoValidation" }, - { "Variable": "$.detail.resourceType", "StringEquals": "S3", "Next": "S3Validation" } - ], "Default": "GenericSkip" }, - "AuroraValidation": { "Type": "Task", "Resource": "${lambda_arn_aurora}" , "ResultPath": "$.validation", "Next": "PublishResult" }, - "RDSValidation": { "Type": "Task", "Resource": "${lambda_arn_rds}" , "ResultPath": "$.validation", "Next": "PublishResult" }, - "DynamoValidation": { "Type": "Task", "Resource": "${lambda_arn_dynamo}" , "ResultPath": "$.validation", "Next": "PublishResult" }, - "S3Validation": { "Type": "Task", "Resource": "${lambda_arn_s3}" , "ResultPath": "$.validation", "Next": "PublishResult" }, - "GenericSkip": { "Type": "Pass", "Result": { "status": "SKIPPED", "message": "No validator implemented for resourceType" }, "ResultPath": "$.validation", "Next": "PublishResult" }, - "PublishResult": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:putRestoreValidationResult", "Parameters": { "RestoreJobId": "$.detail.restoreJobId", "ValidationStatus": "$.validation.status", "ValidationStatusMessage": "$.validation.message" }, "Next": "EmitMetrics" }, - "EmitMetrics": { "Type": "Task", "Resource": "${lambda_arn_metrics}", "End": true } - } -} -``` - -Notes: - -- `${lambda_arn_*}` produced conditionally via Terraform based on enabled validators. -- Timeout & retry policies applied per Task (e.g. RDS 5 min, S3 2 min, Dynamo 1 min) with `Retry` blocks. -- Could collapse validators into one generic Lambda with plugin pattern. - -## 5. Extensibility Interface - -Users supply validation definitions via Terraform variables consumed by validator Lambda(s). - -### 5.1 Terraform Variables (additions) - -```hcl -variable "restore_validation_config" { - description = "Map keyed by resource type containing validation directives." - type = object({ - rds = optional(object({ - enabled = bool - cluster_identifiers = optional(list(string)) - sql_checks = list(object({ - database = string - statement = string - expected_rows = optional(number) - expected_hash = optional(string) # SHA256 of concatenated row values - timeout_seconds = optional(number) - })) - secret_arn = string # AWS Secrets Manager ARN for master creds or read-only - })) - dynamodb = optional(object({ - enabled = bool - tables = list(string) - checks = list(object({ - table = string - expected_item_count = optional(number) - key_sample = optional(list(object({ - pk = string - sk = optional(string) - expected_item_hash = optional(string) - }))) - })) - })) - s3 = optional(object({ - enabled = bool - buckets = list(object({ - name = string - manifest_s3_uri = optional(string) # points to authoritative manifest - sample_prefixes = optional(list(string)) - compare_object_tags = optional(bool) - })) - })) - aurora = optional(object({ - enabled = bool - clusters = list(string) - sql_checks = list(object({ - cluster_endpoint = optional(string) - database = string - statement = string - expected_rows = optional(number) - })) - secret_arn = string - })) - }) - default = {} -} -``` - - -### 5.2 Lambda Validator Contract - -All validator handlers accept unified event schema: - -```json -{ - "restoreJobId": "string", - "resourceType": "RDS|Aurora|DynamoDB|S3|...", - "createdResourceArn": "arn:aws:...", - "config": { "...resource specific config subset..." } -} -``` -Return object: - - -```json -{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "Human readable" } -``` - - -### 5.3 Packaging Strategy - -- Single Lambda with language (Python/Node) loads `config` JSON from SSM Parameter or encrypted file in S3 (to avoid large env variables) -- Pluggable validators registered in a dict keyed by resource type -- Optional user-provided Lambda ARN override per resource type for complete custom logic - -### 5.4 Validation Logic Patterns - -| Resource | Strategy | Failure Conditions | -|----------|----------|-------------------| -| RDS/Aurora | Execute SQL checks (each inside txn, read-only) | Query error, row count mismatch, hash mismatch, timeout | -| DynamoDB | DescribeTable + (optional) Scan limit or PartiQL key gets | Table missing, item count variance > threshold, sample hash mismatch | -| S3 | HEAD sample objects, optional compare against manifest (object key + size + etag) | Missing objects, size/etag mismatch, manifest not accessible | -| EBS (future) | (Optional) Attach test volume to temp instance and run FS metadata probe script | Attach failure, FS errors | - -## 6. Examples - -### 6.1 RDS Example Config - -```hcl -restore_validation_config = { - rds = { - enabled = true - secret_arn = aws_secretsmanager_secret.rds_ro.arn - sql_checks = [ - { database = "appdb", statement = "SELECT COUNT(*) c FROM customers", expected_rows = 1 }, - { database = "appdb", statement = "SELECT sha256(string_agg(id || ':' || status, ',' ORDER BY id)) h FROM orders", expected_hash = "abc123..." } - ] - } -} -``` - -### 6.2 DynamoDB Example Config - -```hcl -restore_validation_config = { - dynamodb = { - enabled = true - tables = ["orders", "customers"] - checks = [ - { table = "orders", expected_item_count = 15000 }, - { table = "customers", key_sample = [ { pk = "CUST#123", expected_item_hash = "d41d8cd98f" } ] } - ] - } -} -``` - -### 6.3 S3 Example Config - -```hcl -restore_validation_config = { - s3 = { - enabled = true - buckets = [{ - name = "images-bucket", - manifest_s3_uri = "s3://manifests-prod/images-bucket.manifest.json", - sample_prefixes = ["2025/09/", "2025/08/"] - }] - } -} -``` - -## 7. Security & Compliance - -- IAM: Validators assume dedicated role with least-privilege policies (RDS: `rds-data:ExecuteStatement` / `secretsmanager:GetSecretValue`; DynamoDB: `DescribeTable`, `GetItem`, limited `Scan` with `Limit`; S3: `HeadObject`, `GetObject` for manifest) -- Secrets: Use Secrets Manager for DB creds; do not log credentials or query data -- KMS: Encrypt Lambda environment variables, S3 manifest bucket, and Secrets Manager secret -- Network: For RDS/Aurora in private subnets, place Lambda in same VPC subnets with least required SG egress -- Auditing: Structured JSON logs (include `restoreJobId`, `resourceType`, check identifiers) -- PII Minimisation: Hash or count only; avoid selecting raw personal data rows -- Integrity of config: Optionally sign config file (S3 object with checksum validation before use) - -## 8. Operational Considerations & Cost - -- Throttle: Concurrency controls via Step Functions + reserved concurrency on validator Lambda to avoid storm after bulk restores -- Timeouts: Short per-check timeouts (e.g. 30s; fail fast pattern) -- Retention Window: If deeper validation requires longer retention, expose `retain_hours_before_cleanup` variable (aligns with AWS restore testing retention concept) -- Metrics: Emit CloudWatch custom metrics: `ValidationSuccess`, `ValidationFailure`, `ValidationDurationMs` with dimensions `ResourceType`, `PlanName` -- Alerting: SNS topic for failures >0 in last run, or error rate > threshold across rolling period -- Cost Levers: Limit number of SQL checks; use targeted `GetItem` vs full table scans; sample S3 objects (k=20 per prefix) unless manifest diff required - -## 9. Acceptance Criteria Mapping - -| Requirement | Design Element | -|------------|----------------| -| "Ability from the blueprint to run automated test to validate restoration" | EventBridge + Step Functions + validators triggered on restore completion | -| "Test integrity of restored resource, specific to blueprint implementer" | `restore_validation_config` + per-resource plugin architecture | -| "Define an SQL query for RDS to test integrity" | `sql_checks` array with expected rows/hash support | -| "Customer responsible for defining and validating check" | User supplies Terraform variable config and (optionally) custom Lambda override | -| "Step function would just allow this functionality" | State machine orchestrates and records results via `PutRestoreValidationResult` | - -## 10. Future Enhancements - -- Add cross-account validation (restore to isolated test account, assume role back) -- Support FSx / EFS mount probing using Fargate task -- Provide Terraform module subfolder `validation` generating Step Functions + default validator Lambda -- Add canned dashboards (CloudWatch) for validation pass rate & duration - -## 11. Terraform Module Additions (Summary) - -Minimal initial scope: - -1. New optional module `aws-backup-validation` OR integrated into `aws-backup-source` behind feature flag `enable_restore_validation` -2. Resources: - - EventBridge rule - - Step Functions state machine (JSON from templatefile) - - IAM roles/policies (state machine + lambda) - - Validator Lambda (zip from local build or external source) - - SSM Parameter / S3 object for config JSON -3. Variables: `enable_restore_validation`, `restore_validation_config`, `custom_validator_lambda_arns` (map) -4. Outputs: `restore_validation_state_machine_arn`, `restore_validation_config_parameter_arn` - -Current prototype implementation lives in `modules/aws-backup-validation` and provides a minimal Lambda + Step Functions + EventBridge rule path. Future iterations should harden IAM scoping and expand validator logic prior to production adoption. - -## 12. Example User Flow - -1. Enable restore testing (already done with existing plan resources) -2. Set `enable_restore_validation = true` -3. Provide `restore_validation_config` with at least one resource type -4. Apply Terraform – deploys validation infra -5. Wait for scheduled restore test; Step Functions records validation results -6. View status in AWS Backup Console / CloudWatch dashboard - -## 13. Risks & Mitigations - -| Risk | Mitigation | -|------|------------| -| Long-running SQL leads to Lambda timeout | Enforce per-query timeout + limit operations (SELECT only) | -| Validator failure blocks result publishing | Wrap each validator in try/catch; on unhandled exception mark FAILED with reason | -| Sensitive data leakage in logs | Scrub query parameters and row data; log only counts + hashes | -| Drift between Terraform config and live validator config | Version config (include checksum) and log version per run | -| Excess costs from scanning large DynamoDB tables | Use item count from `DescribeTable` and targeted sample keys, avoid full scans | - -## 14. Open Questions - -- Provide managed library of validation query templates? (Out of initial scope) -- Should retention hours be explicitly configurable per selection via Terraform? (Potential future variable) -- Add option for concurrency-limited validation queue (SQS + Lambda) instead of Step Functions? (Future scale consideration) -