From 518710ca414245ee853dfdcf70ea872b18f27a68 Mon Sep 17 00:00:00 2001
From: Nick Miles <nick.miles5@nhs.net>
Date: Sat, 20 Sep 2025 01:11:41 +0100
Subject: [PATCH 1/2] ENG-893 AWS Backup Restore Testing Validation & Integrity
 Design

---
 docs/manual-restore-validation.md             | 137 ++++++++
 docs/restore-testing-design.md                | 312 ++++++++++++++++++
 examples/customer-s3-validator/index.ts       |  58 ++++
 examples/customer-s3-validator/package.json   |  16 +
 examples/customer-s3-validator/tsconfig.json  |  14 +
 examples/manual-validation/main.tf            |  83 +++++
 .../aws-backup-manual-validation/README.md    |  84 +++++
 .../dist/orchestrator.js                      | 106 ++++++
 modules/aws-backup-manual-validation/iam.tf   |  76 +++++
 .../aws-backup-manual-validation/lambda.tf    |  45 +++
 .../aws-backup-manual-validation/outputs.tf   |   4 +
 .../aws-backup-manual-validation/package.json |  20 ++
 .../src/orchestrator.ts                       | 125 +++++++
 .../tsconfig.json                             |  16 +
 .../aws-backup-manual-validation/variables.tf |  43 +++
 .../aws-backup-manual-validation/versions.tf  |   9 +
 16 files changed, 1148 insertions(+)
 create mode 100644 docs/manual-restore-validation.md
 create mode 100644 docs/restore-testing-design.md
 create mode 100644 examples/customer-s3-validator/index.ts
 create mode 100644 examples/customer-s3-validator/package.json
 create mode 100644 examples/customer-s3-validator/tsconfig.json
 create mode 100644 examples/manual-validation/main.tf
 create mode 100644 modules/aws-backup-manual-validation/README.md
 create mode 100644 modules/aws-backup-manual-validation/dist/orchestrator.js
 create mode 100644 modules/aws-backup-manual-validation/iam.tf
 create mode 100644 modules/aws-backup-manual-validation/lambda.tf
 create mode 100644 modules/aws-backup-manual-validation/outputs.tf
 create mode 100644 modules/aws-backup-manual-validation/package.json
 create mode 100644 modules/aws-backup-manual-validation/src/orchestrator.ts
 create mode 100644 modules/aws-backup-manual-validation/tsconfig.json
 create mode 100644 modules/aws-backup-manual-validation/variables.tf
 create mode 100644 modules/aws-backup-manual-validation/versions.tf

diff --git a/docs/manual-restore-validation.md b/docs/manual-restore-validation.md
new file mode 100644
index 0000000..e785bbc
--- /dev/null
+++ b/docs/manual-restore-validation.md
@@ -0,0 +1,137 @@
+# Manual Restore Validation Design
+
+## 1. Purpose
+
+Provide a light‑weight, on-demand restore + validation workflow where the **customer supplies their own validation Lambda**. This complements automated restore testing plans by enabling ad-hoc integrity checks (e.g. regression assessment after schema change, pre-cutover rehearsal) without standing orchestration state machines.
+
+## 2. Overview
+
+Flow (single resource type per invocation):
+
+1. Operator (or CI job) invokes Orchestrator Lambda with optional `recoveryPointArn`.
+2. Orchestrator chooses recovery point (latest if unspecified) from a backup vault.
+3. Starts restore job using AWS Backup `StartRestoreJob`.
+4. Polls status (`DescribeRestoreJob`) until terminal state.
+5. Invokes customer validator Lambda with contextual payload.
+6. Normalises validator response -> calls `PutRestoreValidationResult`.
+7. Returns composite result to caller (for CLI / API inspection).
+
+No Step Functions required for typical short restore + validation cycles; for long running (>15 min) scenarios Step Functions could replace polling.
+
+## 3. Roles & Responsibilities
+
+| Component | Responsibility |
+|-----------|----------------|
+| Orchestrator Lambda | Restore initiation, polling, validator invocation, publishing result |
+| Customer Validator Lambda | Domain/resource-specific integrity checks (S3 object presence, record counts, hashes, etc.) |
+| AWS Backup | Recovery point catalog & restore execution |
+| IAM | Enforces least privilege for restore & validation actions |
+
+## 4. Invocation Payload (Optional Fields)
+
+```json
+{
+  "recoveryPointArn": "arn:aws:backup:...:recovery-point:...",  // optional override
+  "expectedKeys": ["path/example1.txt", "path/example2.txt"],   // validator-specific
+  "expectedMinObjects": 10                                       // optional fallback
+}
+```
+
+## 5. Validator Contract
+
+Input delivered to customer Lambda (superset of invocation + restore context):
+
+```json
+{
+  "restoreJobId": "...",
+  "recoveryPointArn": "...",
+  "resourceType": "S3",
+  "createdResourceArn": "arn:aws:s3:::restored-bucket",
+  "targetBucket": "restored-bucket",
+  "s3": { "bucket": "restored-bucket" },
+  "expectedKeys": ["..."],
+  "expectedMinObjects": 10
+}
+```
+
+Return:
+
+```json
+{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "summary", "details": { } }
+```
+Status mapping is case-insensitive; unknown maps to FAILED.
+
+## 6. Security Considerations
+
+- Orchestrator policy limited to listing recovery points, starting & describing restore jobs, publishing validation, invoking a single validator ARN.
+- Validator policy scoped to specific target bucket ARNs (S3 example).
+- Sensitive data avoidance: orchestrator does not log object contents, only metadata.
+- Optionally use a dedicated IAM restore role if restore requires cross-service access.
+
+## 7. Error Handling
+
+| Scenario | Behaviour |
+|----------|-----------|
+| No recovery points | Orchestrator throws error (non-validation) |
+| Restore timeout | Error after 55m (FAILED not published) |
+| Validator throws | Orchestrator records FAILED with parse/message fallback |
+| Validator returns malformed JSON | Treated as FAILED with parse error message |
+
+## 8. Extensibility
+
+- Add multi-resource batch mode via Step Functions if needed.
+- Support additional resource types by adjusting Metadata mapping (e.g. RDS cluster restore specifics).
+- Emit custom metrics (future) for restore duration & validator latency.
+
+## 9. Example S3 Validator Patterns
+
+| Pattern | Description |
+|---------|-------------|
+| Key existence | Ensure enumerated critical objects are present (manifest-sourced) |
+| Non-empty bucket | Basic continuity signal after restore |
+| Minimum count | Validate approximate dataset size threshold |
+| Sample integrity | (Future) HEAD + ETag comparison against manifest |
+
+## 10. Terraform Surfaces
+
+Module `aws-backup-manual-validation` variables:
+
+- `backup_vault_name` (string, required)
+- `validation_lambda_arn` (string, required)
+- `resource_type` (string, e.g. S3)
+- `target_bucket_name` (string, S3 convenience)
+- `name_prefix` (string)
+
+Outputs:
+
+- `orchestrator_lambda_arn`
+
+## 11. Invocation Examples
+
+AWS CLI (invoke latest recovery point):
+
+```bash
+aws lambda invoke \
+  --function-name myproj-dev-manual-restore-orchestrator \
+  --payload '{}' out.json && cat out.json | jq
+```
+
+Explicit recovery point + expected keys:
+
+```bash
+aws lambda invoke \
+  --function-name myproj-dev-manual-restore-orchestrator \
+  --payload '{"recoveryPointArn":"arn:aws:backup:..","expectedKeys":["manifest.json","data/file1"]}' out.json
+```
+
+## 12. Limitations
+
+- Long-running restores may exceed Lambda timeout (convert to Step Functions for scale/time).
+- Only single resource restore per invocation.
+- No built-in notification channel (user can layer SNS or EventBridge rule on Lambda logs/exits).
+
+## 13. Future Enhancements
+
+- Step Functions wrapper for large parallel restores.
+- Parameter / Secrets retrieval for RDS validation credentials.
+- Config-driven validator selection registry.
diff --git a/docs/restore-testing-design.md b/docs/restore-testing-design.md
new file mode 100644
index 0000000..50b666a
--- /dev/null
+++ b/docs/restore-testing-design.md
@@ -0,0 +1,312 @@
+# AWS Backup Restore Testing Validation & Integrity Design
+
+## 1. Objectives
+
+Provide a blueprint extension that not only provisions AWS Backup Restore Testing Plans (already partially implemented via `awscc_backup_restore_testing_plan` and selections) but also validates that restored resources are *functional* and *internally consistent*. Users (blueprint implementers) define integrity checks per resource type (e.g. SQL query for RDS/Aurora, manifest verification for S3, item checks for DynamoDB) executed automatically after AWS Backup restore tests complete.
+
+## 2. High-Level Architecture
+
+![end-to-end visual of the event-driven validation workflow](diagrams/restore-validation-sequence.png)
+
+```text
+AWS Backup Restore Testing Plan (scheduled)
+        │ (runs restore jobs)
+        ▼
+Restore Test Jobs (Test restore of latest/random recovery points)
+        │ emit EventBridge events (Restore Job State Change: COMPLETED)
+        ▼
+EventBridge Rule (filters status=COMPLETED + restoreTestingPlanArn)
+        │
+        ▼
+Step Functions State Machine (or direct Lambda)  <── optional batching fan‑in
+  1. Fetch restore job details
+  2. Dispatch per resource-type validator (Lambda / Fargate / custom)
+  3. Execute user-defined integrity logic (SQL / API / S3 diff etc.)
+  4. Aggregate results
+  5. Call PutRestoreValidationResult (per restore job)
+  6. Emit metrics + SNS / EventBridge notifications
+        │
+        ▼
+CloudWatch Metrics / Logs / Alarms + Backup Console Validation Status
+```
+
+### Why Step Functions?
+
+- Orchestrates retries, parallel fan-out per restored resource
+- Standardises timeout + backoff policies
+- Simplifies conditional branching for resource types
+- Enables centralised audit trail for validation workflow
+
+A simpler single Lambda path remains possible for minimal setups; design supports either.
+
+> For an ad-hoc, customer‑supplied validator workflow (manual restore + external Lambda validation without Step Functions), see `manual-restore-validation.md`.
+
+## 3. Data & Control Flows
+
+| Flow | Source → Target | Notes |
+|------|-----------------|-------|
+| A | AWS Backup → EventBridge | "Restore Job State Change" event, includes `restoreJobId`, `resourceType`, `createdResourceArn`, `restoreTestingPlanArn` |
+| B | EventBridge → Step Functions | Input filtered by plan ARN / resource types |
+| C | Step Functions → AWS Backup API | `DescribeRestoreJob` for enrichment |
+| D | Step Functions → Validator Lambdas | One per resource type OR generic dispatcher |
+| E | Validators → Target resource | Run integrity checks (SQL, scan, HEAD, etc.) |
+| F | Validators → AWS Backup | `PutRestoreValidationResult(ValidationStatus=SUCCESSFUL\|FAILED\|SKIPPED)` |
+| G | Step Functions → CloudWatch / SNS | Emit metrics, structured JSON log, optional alert |
+
+## 4. State Machine Definition (Express or Standard)
+
+Recommended: **Standard** (because restores may take hours; we only start after COMPLETED but validation might be longer running for large datasets). Express acceptable if you guarantee short validations.
+
+Proposed states (Amazon States Language pseudo):
+
+```json
+{
+  "Comment": "Restore Test Validation Orchestrator",
+  "StartAt": "Init",
+
+  "States": {
+    "Init": { "Type": "Pass", "ResultPath": "$.context", "Next": "EnrichRestoreJob" },
+    "EnrichRestoreJob": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:describeRestoreJob", "Parameters": { "RestoreJobId": "$.detail.restoreJobId" }, "ResultPath": "$.restoreJob", "Next": "RouteByResourceType" },
+    "RouteByResourceType": { "Type": "Choice", "Choices": [
+        { "Variable": "$.detail.resourceType", "StringEquals": "Aurora", "Next": "AuroraValidation" },
+        { "Variable": "$.detail.resourceType", "StringEquals": "RDS", "Next": "RDSValidation" },
+        { "Variable": "$.detail.resourceType", "StringEquals": "DynamoDB", "Next": "DynamoValidation" },
+        { "Variable": "$.detail.resourceType", "StringEquals": "S3", "Next": "S3Validation" }
+      ], "Default": "GenericSkip" },
+    "AuroraValidation": { "Type": "Task", "Resource": "${lambda_arn_aurora}" , "ResultPath": "$.validation", "Next": "PublishResult" },
+    "RDSValidation": { "Type": "Task", "Resource": "${lambda_arn_rds}" , "ResultPath": "$.validation", "Next": "PublishResult" },
+    "DynamoValidation": { "Type": "Task", "Resource": "${lambda_arn_dynamo}" , "ResultPath": "$.validation", "Next": "PublishResult" },
+    "S3Validation": { "Type": "Task", "Resource": "${lambda_arn_s3}" , "ResultPath": "$.validation", "Next": "PublishResult" },
+    "GenericSkip": { "Type": "Pass", "Result": { "status": "SKIPPED", "message": "No validator implemented for resourceType" }, "ResultPath": "$.validation", "Next": "PublishResult" },
+    "PublishResult": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:putRestoreValidationResult", "Parameters": { "RestoreJobId": "$.detail.restoreJobId", "ValidationStatus": "$.validation.status", "ValidationStatusMessage": "$.validation.message" }, "Next": "EmitMetrics" },
+    "EmitMetrics": { "Type": "Task", "Resource": "${lambda_arn_metrics}", "End": true }
+  }
+}
+```
+
+Notes:
+
+- `${lambda_arn_*}` produced conditionally via Terraform based on enabled validators.
+- Timeout & retry policies applied per Task (e.g. RDS 5 min, S3 2 min, Dynamo 1 min) with `Retry` blocks.
+- Could collapse validators into one generic Lambda with plugin pattern.
+
+## 5. Extensibility Interface
+
+Users supply validation definitions via Terraform variables consumed by validator Lambda(s).
+
+### 5.1 Terraform Variables (additions)
+
+```hcl
+variable "restore_validation_config" {
+  description = "Map keyed by resource type containing validation directives."
+  type = object({
+    rds = optional(object({
+      enabled          = bool
+      cluster_identifiers = optional(list(string))
+      sql_checks = list(object({
+        database = string
+        statement = string
+        expected_rows = optional(number)
+        expected_hash = optional(string) # SHA256 of concatenated row values
+        timeout_seconds = optional(number)
+      }))
+      secret_arn = string # AWS Secrets Manager ARN for master creds or read-only
+    }))
+    dynamodb = optional(object({
+      enabled = bool
+      tables  = list(string)
+      checks = list(object({
+        table        = string
+        expected_item_count = optional(number)
+        key_sample = optional(list(object({
+          pk = string
+          sk = optional(string)
+          expected_item_hash = optional(string)
+        })))
+      }))
+    }))
+    s3 = optional(object({
+      enabled = bool
+      buckets = list(object({
+        name = string
+        manifest_s3_uri = optional(string) # points to authoritative manifest
+        sample_prefixes = optional(list(string))
+        compare_object_tags = optional(bool)
+      }))
+    }))
+    aurora = optional(object({
+      enabled   = bool
+      clusters  = list(string)
+      sql_checks = list(object({
+        cluster_endpoint = optional(string)
+        database  = string
+        statement = string
+        expected_rows = optional(number)
+      }))
+      secret_arn = string
+    }))
+  })
+  default = {}
+}
+```
+
+
+### 5.2 Lambda Validator Contract
+
+All validator handlers accept unified event schema:
+
+```json
+{
+  "restoreJobId": "string",
+  "resourceType": "RDS|Aurora|DynamoDB|S3|...",
+  "createdResourceArn": "arn:aws:...",
+  "config": { "...resource specific config subset..." }
+}
+```
+Return object:
+
+
+```json
+{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "Human readable" }
+```
+
+
+### 5.3 Packaging Strategy
+
+- Single Lambda with language (Python/Node) loads `config` JSON from SSM Parameter or encrypted file in S3 (to avoid large env variables)
+- Pluggable validators registered in a dict keyed by resource type
+- Optional user-provided Lambda ARN override per resource type for complete custom logic
+
+### 5.4 Validation Logic Patterns
+
+| Resource | Strategy | Failure Conditions |
+|----------|----------|-------------------|
+| RDS/Aurora | Execute SQL checks (each inside txn, read-only) | Query error, row count mismatch, hash mismatch, timeout |
+| DynamoDB | DescribeTable + (optional) Scan limit or PartiQL key gets | Table missing, item count variance > threshold, sample hash mismatch |
+| S3 | HEAD sample objects, optional compare against manifest (object key + size + etag) | Missing objects, size/etag mismatch, manifest not accessible |
+| EBS (future) | (Optional) Attach test volume to temp instance and run FS metadata probe script | Attach failure, FS errors |
+
+## 6. Examples
+
+### 6.1 RDS Example Config
+
+```hcl
+restore_validation_config = {
+  rds = {
+    enabled = true
+    secret_arn = aws_secretsmanager_secret.rds_ro.arn
+    sql_checks = [
+      { database = "appdb", statement = "SELECT COUNT(*) c FROM customers", expected_rows = 1 },
+      { database = "appdb", statement = "SELECT sha256(string_agg(id || ':' || status, ',' ORDER BY id)) h FROM orders", expected_hash = "abc123..." }
+    ]
+  }
+}
+```
+
+### 6.2 DynamoDB Example Config
+
+```hcl
+restore_validation_config = {
+  dynamodb = {
+    enabled = true
+    tables = ["orders", "customers"]
+    checks = [
+      { table = "orders", expected_item_count = 15000 },
+      { table = "customers", key_sample = [ { pk = "CUST#123", expected_item_hash = "d41d8cd98f" } ] }
+    ]
+  }
+}
+```
+
+### 6.3 S3 Example Config
+
+```hcl
+restore_validation_config = {
+  s3 = {
+    enabled = true
+    buckets = [{
+      name = "images-bucket",
+      manifest_s3_uri = "s3://manifests-prod/images-bucket.manifest.json",
+      sample_prefixes = ["2025/09/", "2025/08/"]
+    }]
+  }
+}
+```
+
+## 7. Security & Compliance
+
+- IAM: Validators assume dedicated role with least-privilege policies (RDS: `rds-data:ExecuteStatement` / `secretsmanager:GetSecretValue`; DynamoDB: `DescribeTable`, `GetItem`, limited `Scan` with `Limit`; S3: `HeadObject`, `GetObject` for manifest)
+- Secrets: Use Secrets Manager for DB creds; do not log credentials or query data
+- KMS: Encrypt Lambda environment variables, S3 manifest bucket, and Secrets Manager secret
+- Network: For RDS/Aurora in private subnets, place Lambda in same VPC subnets with least required SG egress
+- Auditing: Structured JSON logs (include `restoreJobId`, `resourceType`, check identifiers)
+- PII Minimisation: Hash or count only; avoid selecting raw personal data rows
+- Integrity of config: Optionally sign config file (S3 object with checksum validation before use)
+
+## 8. Operational Considerations & Cost
+
+- Throttle: Concurrency controls via Step Functions + reserved concurrency on validator Lambda to avoid storm after bulk restores
+- Timeouts: Short per-check timeouts (e.g. 30s; fail fast pattern)
+- Retention Window: If deeper validation requires longer retention, expose `retain_hours_before_cleanup` variable (aligns with AWS restore testing retention concept)
+- Metrics: Emit CloudWatch custom metrics: `ValidationSuccess`, `ValidationFailure`, `ValidationDurationMs` with dimensions `ResourceType`, `PlanName`
+- Alerting: SNS topic for failures >0 in last run, or error rate > threshold across rolling period
+- Cost Levers: Limit number of SQL checks; use targeted `GetItem` vs full table scans; sample S3 objects (k=20 per prefix) unless manifest diff required
+
+## 9. Acceptance Criteria Mapping
+
+| Requirement | Design Element |
+|------------|----------------|
+| "Ability from the blueprint to run automated test to validate restoration" | EventBridge + Step Functions + validators triggered on restore completion |
+| "Test integrity of restored resource, specific to blueprint implementer" | `restore_validation_config` + per-resource plugin architecture |
+| "Define an SQL query for RDS to test integrity" | `sql_checks` array with expected rows/hash support |
+| "Customer responsible for defining and validating check" | User supplies Terraform variable config and (optionally) custom Lambda override |
+| "Step function would just allow this functionality" | State machine orchestrates and records results via `PutRestoreValidationResult` |
+
+## 10. Future Enhancements
+
+- Add cross-account validation (restore to isolated test account, assume role back)
+- Support FSx / EFS mount probing using Fargate task
+- Provide Terraform module subfolder `validation` generating Step Functions + default validator Lambda
+- Add canned dashboards (CloudWatch) for validation pass rate & duration
+
+## 11. Terraform Module Additions (Summary)
+
+Minimal initial scope:
+
+1. New optional module `aws-backup-validation` OR integrated into `aws-backup-source` behind feature flag `enable_restore_validation`
+2. Resources:
+   - EventBridge rule
+   - Step Functions state machine (JSON from templatefile)
+   - IAM roles/policies (state machine + lambda)
+   - Validator Lambda (zip from local build or external source)
+   - SSM Parameter / S3 object for config JSON
+3. Variables: `enable_restore_validation`, `restore_validation_config`, `custom_validator_lambda_arns` (map)
+4. Outputs: `restore_validation_state_machine_arn`, `restore_validation_config_parameter_arn`
+
+Current prototype implementation lives in `modules/aws-backup-validation` and provides a minimal Lambda + Step Functions + EventBridge rule path. Future iterations should harden IAM scoping and expand validator logic prior to production adoption.
+
+## 12. Example User Flow
+
+1. Enable restore testing (already done with existing plan resources)
+2. Set `enable_restore_validation = true`
+3. Provide `restore_validation_config` with at least one resource type
+4. Apply Terraform – deploys validation infra
+5. Wait for scheduled restore test; Step Functions records validation results
+6. View status in AWS Backup Console / CloudWatch dashboard
+
+## 13. Risks & Mitigations
+
+| Risk | Mitigation |
+|------|------------|
+| Long-running SQL leads to Lambda timeout | Enforce per-query timeout + limit operations (SELECT only) |
+| Validator failure blocks result publishing | Wrap each validator in try/catch; on unhandled exception mark FAILED with reason |
+| Sensitive data leakage in logs | Scrub query parameters and row data; log only counts + hashes |
+| Drift between Terraform config and live validator config | Version config (include checksum) and log version per run |
+| Excess costs from scanning large DynamoDB tables | Use item count from `DescribeTable` and targeted sample keys, avoid full scans |
+
+## 14. Open Questions
+
+- Provide managed library of validation query templates? (Out of initial scope)
+- Should retention hours be explicitly configurable per selection via Terraform? (Potential future variable)
+- Add option for concurrency-limited validation queue (SQS + Lambda) instead of Step Functions? (Future scale consideration)
+
diff --git a/examples/customer-s3-validator/index.ts b/examples/customer-s3-validator/index.ts
new file mode 100644
index 0000000..9192687
--- /dev/null
+++ b/examples/customer-s3-validator/index.ts
@@ -0,0 +1,58 @@
+import { S3Client, HeadObjectCommand, ListObjectsV2Command } from "@aws-sdk/client-s3";
+
+const s3 = new S3Client({});
+
+/* Example validator strategy:
+   1. If event.expectedKeys provided -> verify each exists.
+   2. Else if event.s3.bucket provided -> ensure bucket contains at least one object (or expectedMinObjects).
+   Return status + message summarising findings.
+*/
+
+interface EventShape {
+  restoreJobId: string;
+  recoveryPointArn: string;
+  resourceType: string;
+  createdResourceArn?: string;
+  targetBucket?: string;
+  s3?: { bucket?: string };
+  expectedKeys?: string[];
+  expectedMinObjects?: number;
+}
+
+export const handler = async (event: EventShape) => {
+  const bucket = event.targetBucket || event.s3?.bucket;
+  if (!bucket) {
+    return { status: "SKIPPED", message: "No bucket specified" };
+  }
+
+  if (event.expectedKeys && event.expectedKeys.length > 0) {
+    const missing: string[] = [];
+    for (const key of event.expectedKeys) {
+      try {
+        await s3.send(new HeadObjectCommand({ Bucket: bucket, Key: key }));
+      } catch (e) {
+        missing.push(key);
+      }
+    }
+    if (missing.length > 0) {
+      return { status: "FAILED", message: `Missing ${missing.length} objects`, missing };
+    }
+    return { status: "SUCCESSFUL", message: `All ${event.expectedKeys.length} expected objects present` };
+  }
+
+  // Fallback: simple non-empty check or min object threshold
+  const min = event.expectedMinObjects ?? 1;
+  let found = 0;
+  let ContinuationToken: string | undefined = undefined;
+  while (found < min) {
+    const resp = await s3.send(new ListObjectsV2Command({ Bucket: bucket, MaxKeys: 1000, ContinuationToken }));
+    const count = resp.Contents?.length || 0;
+    found += count;
+    if (!resp.IsTruncated) break;
+    ContinuationToken = resp.NextContinuationToken;
+  }
+  if (found < min) {
+    return { status: "FAILED", message: `Only ${found} objects found (< ${min})` };
+  }
+  return { status: "SUCCESSFUL", message: `Found ${found} objects (>= ${min})` };
+};
diff --git a/examples/customer-s3-validator/package.json b/examples/customer-s3-validator/package.json
new file mode 100644
index 0000000..0761b10
--- /dev/null
+++ b/examples/customer-s3-validator/package.json
@@ -0,0 +1,16 @@
+{
+  "name": "customer-s3-validator-example",
+  "version": "0.1.0",
+  "private": true,
+  "type": "module",
+  "scripts": {
+    "build": "tsc -p tsconfig.json"
+  },
+  "dependencies": {
+    "@aws-sdk/client-s3": "^3.637.0"
+  },
+  "devDependencies": {
+    "typescript": "^5.4.0",
+    "@types/node": "^20.11.0"
+  }
+}
diff --git a/examples/customer-s3-validator/tsconfig.json b/examples/customer-s3-validator/tsconfig.json
new file mode 100644
index 0000000..8cb4dd2
--- /dev/null
+++ b/examples/customer-s3-validator/tsconfig.json
@@ -0,0 +1,14 @@
+{
+  "compilerOptions": {
+    "target": "ES2020",
+    "module": "ES2020",
+    "moduleResolution": "Node",
+    "outDir": "dist",
+    "rootDir": ".",
+    "esModuleInterop": true,
+    "strict": true,
+    "skipLibCheck": true
+  },
+  "include": ["index.ts"],
+  "exclude": ["node_modules"]
+}
diff --git a/examples/manual-validation/main.tf b/examples/manual-validation/main.tf
new file mode 100644
index 0000000..5210885
--- /dev/null
+++ b/examples/manual-validation/main.tf
@@ -0,0 +1,83 @@
+terraform {
+  required_version = ">= 1.5.0"
+  required_providers {
+    aws = {
+      source  = "hashicorp/aws"
+      version = ">= 5.0"
+    }
+  }
+}
+
+provider "aws" {
+  region = var.region
+}
+
+variable "region" { type = string }
+variable "name_prefix" { type = string }
+variable "backup_vault_name" { type = string }
+variable "restore_bucket" { type = string }
+
+# Example customer validator lambda (upload dist bundle manually or integrate build pipeline).
+resource "aws_lambda_function" "customer_validator" {
+  function_name = "${var.name_prefix}-customer-s3-validator"
+  role          = aws_iam_role.customer_validator.arn
+  handler       = "index.handler"
+  runtime       = "nodejs20.x"
+  filename      = "./lambda_customer_validator.zip" # user supplied artifact
+  source_code_hash = filebase64sha256("./lambda_customer_validator.zip")
+  timeout       = 60
+  environment {
+    variables = {}
+  }
+}
+
+resource "aws_iam_role" "customer_validator" {
+  name               = "${var.name_prefix}-customer-s3-validator-role"
+  assume_role_policy = data.aws_iam_policy_document.lambda_assume.json
+}
+
+data "aws_iam_policy_document" "lambda_assume" {
+  statement {
+    actions = ["sts:AssumeRole"]
+    principals { type = "Service" identifiers = ["lambda.amazonaws.com"] }
+  }
+}
+
+resource "aws_iam_role_policy_attachment" "logs_attach_customer" {
+  role       = aws_iam_role.customer_validator.name
+  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
+}
+
+resource "aws_iam_policy" "customer_s3_policy" {
+  name   = "${var.name_prefix}-customer-s3-validator-policy"
+  policy = jsonencode({
+    Version = "2012-10-17"
+    Statement = [
+      {
+        Effect   = "Allow"
+        Action   = ["s3:ListBucket", "s3:GetObject", "s3:HeadObject"]
+        Resource = [
+          "arn:aws:s3:::${var.restore_bucket}",
+          "arn:aws:s3:::${var.restore_bucket}/*"
+        ]
+      }
+    ]
+  })
+}
+
+resource "aws_iam_role_policy_attachment" "customer_validator_attach" {
+  role       = aws_iam_role.customer_validator.name
+  policy_arn = aws_iam_policy.customer_s3_policy.arn
+}
+
+module "manual_validation" {
+  source                = "../../modules/aws-backup-manual-validation"
+  enable                = true
+  name_prefix           = var.name_prefix
+  backup_vault_name     = var.backup_vault_name
+  resource_type         = "S3"
+  validation_lambda_arn = aws_lambda_function.customer_validator.arn
+  target_bucket_name    = var.restore_bucket
+}
+
+output "orchestrator_lambda" { value = module.manual_validation.orchestrator_lambda_arn }
diff --git a/modules/aws-backup-manual-validation/README.md b/modules/aws-backup-manual-validation/README.md
new file mode 100644
index 0000000..4ec158c
--- /dev/null
+++ b/modules/aws-backup-manual-validation/README.md
@@ -0,0 +1,84 @@
+# AWS Backup Manual Restore Validation Module
+
+Provides an on-demand Lambda **orchestrator** that:
+
+1. Selects a recovery point (latest by default) from a specified backup vault.
+2. Starts a restore job for the chosen recovery point (supports S3 in example).
+3. Waits for restore job completion (polling AWS Backup).
+4. Invokes a **customer-provided validation Lambda** (you own resource-specific logic).
+5. Publishes validation status back to AWS Backup using `PutRestoreValidationResult`.
+
+This pattern differs from automated restore testing plans: it is **manually triggered** (e.g. via `aws lambda invoke` or an API Gateway front-end) and delegates validation logic entirely to a customer-maintained Lambda.
+
+## Key Design Principles
+
+- **Separation of concerns**: Orchestrator handles restore lifecycle & result publishing; customer Lambda handles semantic integrity checks.
+- **Pluggable**: Any runtime or language for validator (only contract is JSON in/out).
+- **Minimal surface**: No Step Functions required for single-resource manual validation.
+
+## Orchestrator Environment Variables
+
+| Variable | Purpose |
+|----------|---------|
+| `BACKUP_VAULT_NAME` | Source vault to enumerate recovery points |
+| `RESOURCE_TYPE` | Backup resource type (e.g. `S3`) |
+| `VALIDATOR_LAMBDA` | ARN of customer validator Lambda |
+| `TARGET_BUCKET` | (S3 only) Destination bucket name to validate |
+| `RESTORE_ROLE_ARN` | (Optional) IAM role used for restore job |
+
+## Customer Validator Contract
+
+**Invocation Payload** (example):
+
+```json
+{
+  "restoreJobId": "1234abcd",
+  "recoveryPointArn": "arn:aws:backup:...:recovery-point:...",
+  "resourceType": "S3",
+  "createdResourceArn": "arn:aws:s3:::restored-bucket",
+  "targetBucket": "restored-bucket",
+  "s3": { "bucket": "restored-bucket" }
+}
+```
+
+**Return Object**:
+
+```json
+{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "Human readable summary" }
+```
+Statuses are normalised by the orchestrator before calling `PutRestoreValidationResult`.
+
+## Terraform Inputs
+
+See `variables.tf` for full list. Essential:
+
+```hcl
+module "manual_validation" {
+  source               = "../modules/aws-backup-manual-validation"
+  enable               = true
+  name_prefix          = var.name_prefix
+  backup_vault_name    = var.backup_vault_name
+  resource_type        = "S3"
+  validation_lambda_arn = aws_lambda_function.customer_validator.arn
+  target_bucket_name   = var.target_restore_bucket
+}
+```
+
+## Example Validator (S3 Presence / Count)
+
+See `../../examples/customer-s3-validator` for a full TypeScript implementation scanning a set of expected keys or listing a prefix to ensure non-empty restore.
+
+## Operational Notes
+
+- Timeouts: Orchestrator Lambda default timeout is 15 minutes; long restores will exceed this—use small test datasets or adapt to Step Functions if needed.
+- Costs: Avoid listing millions of S3 keys in the validator; prefer sampling.
+- IAM Hardening: Current policy uses broad `backup:*` subset and `s3:Get*`; tighten to specific ARNs in production.
+
+## Future Enhancements
+
+- Option to specify explicit recovery point instead of auto-pick (supported already via event.recoveryPointArn field).
+- Emit custom CloudWatch metrics for validation duration & success rate.
+- Optional SNS notification on failure.
+
+---
+MIT style licensing per repository policy.
diff --git a/modules/aws-backup-manual-validation/dist/orchestrator.js b/modules/aws-backup-manual-validation/dist/orchestrator.js
new file mode 100644
index 0000000..9c19246
--- /dev/null
+++ b/modules/aws-backup-manual-validation/dist/orchestrator.js
@@ -0,0 +1,106 @@
+import { BackupClient, ListRecoveryPointsByBackupVaultCommand, StartRestoreJobCommand, DescribeRestoreJobCommand, PutRestoreValidationResultCommand } from "@aws-sdk/client-backup";
+import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda";
+import { S3Client } from "@aws-sdk/client-s3";
+const backup = new BackupClient({});
+const lambda = new LambdaClient({});
+const s3 = new S3Client({});
+const BACKUP_VAULT_NAME = process.env.BACKUP_VAULT_NAME;
+const RESOURCE_TYPE = process.env.RESOURCE_TYPE; // e.g. S3
+const VALIDATOR_LAMBDA = process.env.VALIDATOR_LAMBDA;
+const TARGET_BUCKET = process.env.TARGET_BUCKET; // optional S3 bucket
+export const handler = async (event = {}) => {
+  console.log(JSON.stringify({ msg: "Manual restore orchestration start", event }));
+  const recoveryPointArn = event.recoveryPointArn || await pickLatestRecoveryPoint();
+  console.log({ recoveryPointArn });
+  const restoreJobId = await startRestore(recoveryPointArn);
+  console.log({ restoreJobId });
+  const restoreDesc = await waitForCompletion(restoreJobId);
+  console.log({ restoreDesc });
+  const validatorPayload = {
+    restoreJobId,
+    recoveryPointArn,
+    resourceType: RESOURCE_TYPE,
+    createdResourceArn: restoreDesc.CreatedResourceArn,
+    targetBucket: TARGET_BUCKET,
+    s3: { bucket: TARGET_BUCKET }
+  };
+  const validationResult = await invokeValidator(validatorPayload);
+  console.log({ validationResult });
+  await publishValidation(restoreJobId, validationResult);
+  return {
+    restoreJobId,
+    recoveryPointArn,
+    validation: validationResult
+  };
+};
+async function pickLatestRecoveryPoint() {
+  const cmd = new ListRecoveryPointsByBackupVaultCommand({ BackupVaultName: BACKUP_VAULT_NAME, MaxResults: 20 });
+  const resp = await backup.send(cmd);
+  if (!resp.RecoveryPoints || resp.RecoveryPoints.length === 0) {
+    throw new Error("No recovery points found in vault");
+  }
+  const sorted = [...resp.RecoveryPoints].sort((a, b) => (b.CreationDate?.getTime() || 0) - (a.CreationDate?.getTime() || 0));
+  return sorted[0].RecoveryPointArn;
+}
+async function startRestore(recoveryPointArn) {
+  const cmd = new StartRestoreJobCommand({
+    RecoveryPointArn: recoveryPointArn,
+    IamRoleArn: process.env.RESTORE_ROLE_ARN,
+    ResourceType: RESOURCE_TYPE,
+    Metadata: TARGET_BUCKET ? { destinationBucketName: TARGET_BUCKET } : {}
+  });
+  const resp = await backup.send(cmd);
+  if (!resp.RestoreJobId)
+    throw new Error("StartRestoreJob returned no RestoreJobId");
+  return resp.RestoreJobId;
+}
+async function waitForCompletion(restoreJobId) {
+  const timeoutMs = 1000 * 60 * 55;
+  const start = Date.now();
+  while (Date.now() - start < timeoutMs) {
+    const desc = await backup.send(new DescribeRestoreJobCommand({ RestoreJobId: restoreJobId }));
+    if (desc.Status === "COMPLETED" || desc.Status === "ABORTED" || desc.Status === "FAILED") {
+      return desc;
+    }
+    await new Promise(r => setTimeout(r, 15000));
+  }
+  throw new Error("Restore job did not finish within timeout");
+}
+async function invokeValidator(payload) {
+  const cmd = new InvokeCommand({
+    FunctionName: VALIDATOR_LAMBDA,
+    InvocationType: "RequestResponse",
+    Payload: Buffer.from(JSON.stringify(payload))
+  });
+  const resp = await lambda.send(cmd);
+  if (!resp.Payload)
+    throw new Error("Validator returned no payload");
+  const txt = Buffer.from(resp.Payload).toString("utf-8");
+  try {
+    return JSON.parse(txt);
+  } catch (e) {
+    throw new Error("Validator payload JSON parse error: " + txt);
+  }
+}
+async function publishValidation(restoreJobId, result) {
+  const status = mapStatus(result.status);
+  const message = (result.message || "").slice(0, 1000);
+  const cmd = new PutRestoreValidationResultCommand({
+    RestoreJobId: restoreJobId,
+    ValidationStatus: status,
+    ValidationStatusMessage: message
+  });
+  await backup.send(cmd);
+}
+function mapStatus(s) {
+  if (!s)
+    return "FAILED";
+  const upper = s.toUpperCase();
+  if (["SUCCESS", "SUCCESSFUL", "OK"].includes(upper))
+    return "SUCCESSFUL";
+  if (["FAILED", "FAIL", "ERROR"].includes(upper))
+    return "FAILED";
+  if (["SKIPPED", "IGNORE", "IGNORED"].includes(upper))
+    return "SKIPPED";
+  return "FAILED";
+}
diff --git a/modules/aws-backup-manual-validation/iam.tf b/modules/aws-backup-manual-validation/iam.tf
new file mode 100644
index 0000000..1601dc3
--- /dev/null
+++ b/modules/aws-backup-manual-validation/iam.tf
@@ -0,0 +1,76 @@
+locals {
+  manual_validation_name = "${var.name_prefix}-manual-restore-validation"
+}
+
+resource "aws_iam_role" "orchestrator" {
+  count = var.enable ? 1 : 0
+  name               = "${local.manual_validation_name}-orchestrator"
+  assume_role_policy = data.aws_iam_policy_document.orchestrator_assume.json
+}
+
+data "aws_iam_policy_document" "orchestrator_assume" {
+  statement {
+    actions = ["sts:AssumeRole"]
+    principals {
+      type        = "Service"
+      identifiers = ["lambda.amazonaws.com"]
+    }
+  }
+}
+
+# NOTE: Permissions are intentionally broad placeholders; should be tightened.
+# Includes: listing recovery points, starting restore job, describing restore job,
+# invoking customer validation Lambda, writing logs, optional S3 read.
+
+data "aws_iam_policy_document" "orchestrator" {
+  statement {
+    sid = "Logs"
+    actions = [
+      "logs:CreateLogGroup",
+      "logs:CreateLogStream",
+      "logs:PutLogEvents"
+    ]
+    resources = ["*"]
+  }
+
+  statement {
+    sid = "BackupCore"
+    actions = [
+      "backup:ListRecoveryPointsByBackupVault",
+      "backup:StartRestoreJob",
+      "backup:DescribeRestoreJob",
+      "backup:PutRestoreValidationResult"
+    ]
+    resources = ["*"]
+  }
+
+  statement {
+    sid = "InvokeValidator"
+    actions = [
+      "lambda:InvokeFunction"
+    ]
+    resources = [var.validation_lambda_arn]
+  }
+
+  statement {
+    sid = "S3ReadOptional"
+    actions = [
+      "s3:ListBucket",
+      "s3:GetObject",
+      "s3:HeadObject"
+    ]
+    resources = ["*"]
+  }
+}
+
+resource "aws_iam_policy" "orchestrator" {
+  count  = var.enable ? 1 : 0
+  name   = "${local.manual_validation_name}-policy"
+  policy = data.aws_iam_policy_document.orchestrator.json
+}
+
+resource "aws_iam_role_policy_attachment" "orchestrator" {
+  count      = var.enable ? 1 : 0
+  role       = aws_iam_role.orchestrator[0].name
+  policy_arn = aws_iam_policy.orchestrator[0].arn
+}
diff --git a/modules/aws-backup-manual-validation/lambda.tf b/modules/aws-backup-manual-validation/lambda.tf
new file mode 100644
index 0000000..58d6de6
--- /dev/null
+++ b/modules/aws-backup-manual-validation/lambda.tf
@@ -0,0 +1,45 @@
+locals {
+  orchestrator_src_dir = "${path.module}/src"
+}
+
+resource "aws_cloudwatch_log_group" "orchestrator" {
+  count             = var.enable ? 1 : 0
+  name              = "/aws/lambda/${aws_lambda_function.orchestrator[0].function_name}"
+  retention_in_days = 30
+}
+
+# We keep a pre-built JS file for simplicity; user can rebuild if modifying.
+# (If a build step is desired, integrate external build pipeline.)
+
+data "archive_file" "orchestrator" {
+  type        = "zip"
+  source_file = "${path.module}/dist/orchestrator.js"
+  output_path = "${path.module}/dist/orchestrator.zip"
+}
+
+resource "aws_lambda_function" "orchestrator" {
+  count         = var.enable ? 1 : 0
+  function_name = "${var.name_prefix}-manual-restore-orchestrator"
+  role          = aws_iam_role.orchestrator[0].arn
+  handler       = "orchestrator.handler"
+  runtime       = "nodejs20.x"
+  filename      = data.archive_file.orchestrator.output_path
+  source_code_hash = data.archive_file.orchestrator.output_base64sha256
+  timeout       = 900
+  memory_size   = 256
+
+  environment {
+    variables = {
+      BACKUP_VAULT_NAME = var.backup_vault_name
+      RESOURCE_TYPE     = var.resource_type
+      VALIDATOR_LAMBDA  = var.validation_lambda_arn
+      TARGET_BUCKET     = var.target_bucket_name
+    }
+  }
+  tags = var.tags
+}
+
+output "manual_restore_orchestrator_lambda_arn" {
+  value       = try(aws_lambda_function.orchestrator[0].arn, null)
+  description = "ARN of the manual restore orchestrator lambda"
+}
diff --git a/modules/aws-backup-manual-validation/outputs.tf b/modules/aws-backup-manual-validation/outputs.tf
new file mode 100644
index 0000000..6cc5e34
--- /dev/null
+++ b/modules/aws-backup-manual-validation/outputs.tf
@@ -0,0 +1,4 @@
+output "orchestrator_lambda_arn" {
+  value       = try(aws_lambda_function.orchestrator[0].arn, null)
+  description = "Manual restore validation orchestrator Lambda ARN"
+}
diff --git a/modules/aws-backup-manual-validation/package.json b/modules/aws-backup-manual-validation/package.json
new file mode 100644
index 0000000..fc5555b
--- /dev/null
+++ b/modules/aws-backup-manual-validation/package.json
@@ -0,0 +1,20 @@
+{
+  "name": "aws-backup-manual-validation-orchestrator",
+  "version": "0.1.0",
+  "private": true,
+  "type": "module",
+  "scripts": {
+    "build": "tsc --project tsconfig.json",
+    "clean": "rimraf dist"
+  },
+  "dependencies": {
+    "@aws-sdk/client-backup": "^3.637.0",
+    "@aws-sdk/client-lambda": "^3.637.0",
+    "@aws-sdk/client-s3": "^3.637.0"
+  },
+  "devDependencies": {
+    "typescript": "^5.4.0",
+    "@types/node": "^20.11.0",
+    "rimraf": "^5.0.5"
+  }
+}
diff --git a/modules/aws-backup-manual-validation/src/orchestrator.ts b/modules/aws-backup-manual-validation/src/orchestrator.ts
new file mode 100644
index 0000000..eda153e
--- /dev/null
+++ b/modules/aws-backup-manual-validation/src/orchestrator.ts
@@ -0,0 +1,125 @@
+/* Orchestrator Lambda (TypeScript)
+   Triggers a manual restore job for a chosen recovery point and invokes a customer-provided validation Lambda.
+   The customer Lambda should return JSON: { status: "SUCCESSFUL|FAILED|SKIPPED", message: string }
+*/
+import { BackupClient, ListRecoveryPointsByBackupVaultCommand, StartRestoreJobCommand, DescribeRestoreJobCommand, PutRestoreValidationResultCommand } from "@aws-sdk/client-backup";
+import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda";
+import { S3Client, HeadObjectCommand } from "@aws-sdk/client-s3";
+
+const backup = new BackupClient({});
+const lambda = new LambdaClient({});
+const s3 = new S3Client({});
+
+const BACKUP_VAULT_NAME = process.env.BACKUP_VAULT_NAME!;
+const RESOURCE_TYPE = process.env.RESOURCE_TYPE!; // e.g. S3
+const VALIDATOR_LAMBDA = process.env.VALIDATOR_LAMBDA!;
+const TARGET_BUCKET = process.env.TARGET_BUCKET; // optional S3 bucket
+
+interface ValidatorResult { status: string; message?: string; [k: string]: any }
+
+export const handler = async (event: any = {}): Promise<any> => {
+  console.log(JSON.stringify({ msg: "Manual restore orchestration start", event }));
+
+  const recoveryPointArn = event.recoveryPointArn || await pickLatestRecoveryPoint();
+  console.log({ recoveryPointArn });
+
+  const restoreJobId = await startRestore(recoveryPointArn);
+  console.log({ restoreJobId });
+
+  const restoreDesc = await waitForCompletion(restoreJobId);
+  console.log({ restoreDesc });
+
+  const validatorPayload = {
+    restoreJobId,
+    recoveryPointArn,
+    resourceType: RESOURCE_TYPE,
+    createdResourceArn: restoreDesc.CreatedResourceArn,
+    targetBucket: TARGET_BUCKET,
+    // Additional S3 example context the customer validator might use:
+    s3: { bucket: TARGET_BUCKET }
+  };
+
+  const validationResult = await invokeValidator(validatorPayload);
+  console.log({ validationResult });
+
+  await publishValidation(restoreJobId, validationResult);
+
+  return {
+    restoreJobId,
+    recoveryPointArn,
+    validation: validationResult
+  };
+};
+
+async function pickLatestRecoveryPoint(): Promise<string> {
+  const cmd = new ListRecoveryPointsByBackupVaultCommand({ BackupVaultName: BACKUP_VAULT_NAME, MaxResults: 20 });
+  const resp = await backup.send(cmd);
+  if (!resp.RecoveryPoints || resp.RecoveryPoints.length === 0) {
+    throw new Error("No recovery points found in vault");
+  }
+  // Sort by CreationDate descending
+  const sorted = [...resp.RecoveryPoints].sort((a, b) => (b.CreationDate?.getTime() || 0) - (a.CreationDate?.getTime() || 0));
+  return sorted[0].RecoveryPointArn!;
+}
+
+async function startRestore(recoveryPointArn: string): Promise<string> {
+  // For S3 we can do a metadata-only restore or specify a placeholder
+  const cmd = new StartRestoreJobCommand({
+    RecoveryPointArn: recoveryPointArn,
+    IamRoleArn: process.env.RESTORE_ROLE_ARN,
+    ResourceType: RESOURCE_TYPE,
+    Metadata: TARGET_BUCKET ? { destinationBucketName: TARGET_BUCKET } : {}
+  });
+  const resp = await backup.send(cmd);
+  if (!resp.RestoreJobId) throw new Error("StartRestoreJob returned no RestoreJobId");
+  return resp.RestoreJobId;
+}
+
+async function waitForCompletion(restoreJobId: string) {
+  const timeoutMs = 1000 * 60 * 55; // 55 minutes safety
+  const start = Date.now();
+  while (Date.now() - start < timeoutMs) {
+    const desc = await backup.send(new DescribeRestoreJobCommand({ RestoreJobId: restoreJobId }));
+    if (desc.Status === "COMPLETED" || desc.Status === "ABORTED" || desc.Status === "FAILED") {
+      return desc;
+    }
+    await new Promise(r => setTimeout(r, 15000));
+  }
+  throw new Error("Restore job did not finish within timeout");
+}
+
+async function invokeValidator(payload: any): Promise<ValidatorResult> {
+  const cmd = new InvokeCommand({
+    FunctionName: VALIDATOR_LAMBDA,
+    InvocationType: "RequestResponse",
+    Payload: Buffer.from(JSON.stringify(payload))
+  });
+  const resp = await lambda.send(cmd);
+  if (!resp.Payload) throw new Error("Validator returned no payload");
+  const txt = Buffer.from(resp.Payload).toString("utf-8");
+  try {
+    return JSON.parse(txt);
+  } catch (e) {
+    throw new Error("Validator payload JSON parse error: " + txt);
+  }
+}
+
+async function publishValidation(restoreJobId: string, result: ValidatorResult) {
+  const status = mapStatus(result.status);
+  const message = (result.message || "").slice(0, 1000);
+  const cmd = new PutRestoreValidationResultCommand({
+    RestoreJobId: restoreJobId,
+    ValidationStatus: status,
+    ValidationStatusMessage: message
+  });
+  await backup.send(cmd);
+}
+
+function mapStatus(s?: string): string {
+  if (!s) return "FAILED";
+  const upper = s.toUpperCase();
+  if (["SUCCESS", "SUCCESSFUL", "OK"].includes(upper)) return "SUCCESSFUL";
+  if (["FAILED", "FAIL", "ERROR"].includes(upper)) return "FAILED";
+  if (["SKIPPED", "IGNORE", "IGNORED"].includes(upper)) return "SKIPPED";
+  return "FAILED";
+}
diff --git a/modules/aws-backup-manual-validation/tsconfig.json b/modules/aws-backup-manual-validation/tsconfig.json
new file mode 100644
index 0000000..01dcc7f
--- /dev/null
+++ b/modules/aws-backup-manual-validation/tsconfig.json
@@ -0,0 +1,16 @@
+{
+  "compilerOptions": {
+    "target": "ES2020",
+    "module": "ES2020",
+    "moduleResolution": "Node",
+    "outDir": "dist",
+    "rootDir": "src",
+    "esModuleInterop": true,
+    "forceConsistentCasingInFileNames": true,
+    "strict": true,
+    "skipLibCheck": true,
+    "resolveJsonModule": true
+  },
+  "include": ["src/**/*.ts"],
+  "exclude": ["node_modules"]
+}
diff --git a/modules/aws-backup-manual-validation/variables.tf b/modules/aws-backup-manual-validation/variables.tf
new file mode 100644
index 0000000..4a18a79
--- /dev/null
+++ b/modules/aws-backup-manual-validation/variables.tf
@@ -0,0 +1,43 @@
+variable "enable" {
+  type        = bool
+  default     = true
+  description = "Whether to create manual validation orchestration resources."
+}
+
+variable "name_prefix" {
+  type        = string
+  description = "Prefix used for naming resources (e.g. project-env)."
+}
+
+variable "backup_vault_name" {
+  type        = string
+  description = "Name of the backup vault containing recovery points to restore for manual tests."
+}
+
+variable "restore_role_arn" {
+  type        = string
+  description = "IAM role ARN used by the restore job if a specific role is required (optional)."
+  default     = null
+}
+
+variable "validation_lambda_arn" {
+  type        = string
+  description = "Customer-provided Lambda ARN that performs validation after manual restore completes."
+}
+
+variable "resource_type" {
+  type        = string
+  description = "AWS Backup resource type for manual restore (e.g. S3, DynamoDB, RDS)."
+}
+
+variable "target_bucket_name" {
+  type        = string
+  description = "For S3 restores: name of the destination S3 bucket that the restore will produce or populate. Used only in the example orchestrator logic."
+  default     = null
+}
+
+variable "tags" {
+  type        = map(string)
+  default     = {}
+  description = "Tags to apply to created resources."
+}
diff --git a/modules/aws-backup-manual-validation/versions.tf b/modules/aws-backup-manual-validation/versions.tf
new file mode 100644
index 0000000..7f163ea
--- /dev/null
+++ b/modules/aws-backup-manual-validation/versions.tf
@@ -0,0 +1,9 @@
+terraform {
+  required_version = ">= 1.5.0"
+  required_providers {
+    aws = {
+      source  = "hashicorp/aws"
+      version = ">= 5.0"
+    }
+  }
+}

From c81b0fc8e4ab0c7654bfe8c554a1948431bd018d Mon Sep 17 00:00:00 2001
From: Nick Miles <nick.miles5@nhs.net>
Date: Sat, 20 Sep 2025 01:19:58 +0100
Subject: [PATCH 2/2] ENG-893 Remove accidental inclusion

---
 docs/restore-testing-design.md | 312 ---------------------------------
 1 file changed, 312 deletions(-)
 delete mode 100644 docs/restore-testing-design.md

diff --git a/docs/restore-testing-design.md b/docs/restore-testing-design.md
deleted file mode 100644
index 50b666a..0000000
--- a/docs/restore-testing-design.md
+++ /dev/null
@@ -1,312 +0,0 @@
-# AWS Backup Restore Testing Validation & Integrity Design
-
-## 1. Objectives
-
-Provide a blueprint extension that not only provisions AWS Backup Restore Testing Plans (already partially implemented via `awscc_backup_restore_testing_plan` and selections) but also validates that restored resources are *functional* and *internally consistent*. Users (blueprint implementers) define integrity checks per resource type (e.g. SQL query for RDS/Aurora, manifest verification for S3, item checks for DynamoDB) executed automatically after AWS Backup restore tests complete.
-
-## 2. High-Level Architecture
-
-![end-to-end visual of the event-driven validation workflow](diagrams/restore-validation-sequence.png)
-
-```text
-AWS Backup Restore Testing Plan (scheduled)
-        │ (runs restore jobs)
-        ▼
-Restore Test Jobs (Test restore of latest/random recovery points)
-        │ emit EventBridge events (Restore Job State Change: COMPLETED)
-        ▼
-EventBridge Rule (filters status=COMPLETED + restoreTestingPlanArn)
-        │
-        ▼
-Step Functions State Machine (or direct Lambda)  <── optional batching fan‑in
-  1. Fetch restore job details
-  2. Dispatch per resource-type validator (Lambda / Fargate / custom)
-  3. Execute user-defined integrity logic (SQL / API / S3 diff etc.)
-  4. Aggregate results
-  5. Call PutRestoreValidationResult (per restore job)
-  6. Emit metrics + SNS / EventBridge notifications
-        │
-        ▼
-CloudWatch Metrics / Logs / Alarms + Backup Console Validation Status
-```
-
-### Why Step Functions?
-
-- Orchestrates retries, parallel fan-out per restored resource
-- Standardises timeout + backoff policies
-- Simplifies conditional branching for resource types
-- Enables centralised audit trail for validation workflow
-
-A simpler single Lambda path remains possible for minimal setups; design supports either.
-
-> For an ad-hoc, customer‑supplied validator workflow (manual restore + external Lambda validation without Step Functions), see `manual-restore-validation.md`.
-
-## 3. Data & Control Flows
-
-| Flow | Source → Target | Notes |
-|------|-----------------|-------|
-| A | AWS Backup → EventBridge | "Restore Job State Change" event, includes `restoreJobId`, `resourceType`, `createdResourceArn`, `restoreTestingPlanArn` |
-| B | EventBridge → Step Functions | Input filtered by plan ARN / resource types |
-| C | Step Functions → AWS Backup API | `DescribeRestoreJob` for enrichment |
-| D | Step Functions → Validator Lambdas | One per resource type OR generic dispatcher |
-| E | Validators → Target resource | Run integrity checks (SQL, scan, HEAD, etc.) |
-| F | Validators → AWS Backup | `PutRestoreValidationResult(ValidationStatus=SUCCESSFUL\|FAILED\|SKIPPED)` |
-| G | Step Functions → CloudWatch / SNS | Emit metrics, structured JSON log, optional alert |
-
-## 4. State Machine Definition (Express or Standard)
-
-Recommended: **Standard** (because restores may take hours; we only start after COMPLETED but validation might be longer running for large datasets). Express acceptable if you guarantee short validations.
-
-Proposed states (Amazon States Language pseudo):
-
-```json
-{
-  "Comment": "Restore Test Validation Orchestrator",
-  "StartAt": "Init",
-
-  "States": {
-    "Init": { "Type": "Pass", "ResultPath": "$.context", "Next": "EnrichRestoreJob" },
-    "EnrichRestoreJob": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:describeRestoreJob", "Parameters": { "RestoreJobId": "$.detail.restoreJobId" }, "ResultPath": "$.restoreJob", "Next": "RouteByResourceType" },
-    "RouteByResourceType": { "Type": "Choice", "Choices": [
-        { "Variable": "$.detail.resourceType", "StringEquals": "Aurora", "Next": "AuroraValidation" },
-        { "Variable": "$.detail.resourceType", "StringEquals": "RDS", "Next": "RDSValidation" },
-        { "Variable": "$.detail.resourceType", "StringEquals": "DynamoDB", "Next": "DynamoValidation" },
-        { "Variable": "$.detail.resourceType", "StringEquals": "S3", "Next": "S3Validation" }
-      ], "Default": "GenericSkip" },
-    "AuroraValidation": { "Type": "Task", "Resource": "${lambda_arn_aurora}" , "ResultPath": "$.validation", "Next": "PublishResult" },
-    "RDSValidation": { "Type": "Task", "Resource": "${lambda_arn_rds}" , "ResultPath": "$.validation", "Next": "PublishResult" },
-    "DynamoValidation": { "Type": "Task", "Resource": "${lambda_arn_dynamo}" , "ResultPath": "$.validation", "Next": "PublishResult" },
-    "S3Validation": { "Type": "Task", "Resource": "${lambda_arn_s3}" , "ResultPath": "$.validation", "Next": "PublishResult" },
-    "GenericSkip": { "Type": "Pass", "Result": { "status": "SKIPPED", "message": "No validator implemented for resourceType" }, "ResultPath": "$.validation", "Next": "PublishResult" },
-    "PublishResult": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:putRestoreValidationResult", "Parameters": { "RestoreJobId": "$.detail.restoreJobId", "ValidationStatus": "$.validation.status", "ValidationStatusMessage": "$.validation.message" }, "Next": "EmitMetrics" },
-    "EmitMetrics": { "Type": "Task", "Resource": "${lambda_arn_metrics}", "End": true }
-  }
-}
-```
-
-Notes:
-
-- `${lambda_arn_*}` produced conditionally via Terraform based on enabled validators.
-- Timeout & retry policies applied per Task (e.g. RDS 5 min, S3 2 min, Dynamo 1 min) with `Retry` blocks.
-- Could collapse validators into one generic Lambda with plugin pattern.
-
-## 5. Extensibility Interface
-
-Users supply validation definitions via Terraform variables consumed by validator Lambda(s).
-
-### 5.1 Terraform Variables (additions)
-
-```hcl
-variable "restore_validation_config" {
-  description = "Map keyed by resource type containing validation directives."
-  type = object({
-    rds = optional(object({
-      enabled          = bool
-      cluster_identifiers = optional(list(string))
-      sql_checks = list(object({
-        database = string
-        statement = string
-        expected_rows = optional(number)
-        expected_hash = optional(string) # SHA256 of concatenated row values
-        timeout_seconds = optional(number)
-      }))
-      secret_arn = string # AWS Secrets Manager ARN for master creds or read-only
-    }))
-    dynamodb = optional(object({
-      enabled = bool
-      tables  = list(string)
-      checks = list(object({
-        table        = string
-        expected_item_count = optional(number)
-        key_sample = optional(list(object({
-          pk = string
-          sk = optional(string)
-          expected_item_hash = optional(string)
-        })))
-      }))
-    }))
-    s3 = optional(object({
-      enabled = bool
-      buckets = list(object({
-        name = string
-        manifest_s3_uri = optional(string) # points to authoritative manifest
-        sample_prefixes = optional(list(string))
-        compare_object_tags = optional(bool)
-      }))
-    }))
-    aurora = optional(object({
-      enabled   = bool
-      clusters  = list(string)
-      sql_checks = list(object({
-        cluster_endpoint = optional(string)
-        database  = string
-        statement = string
-        expected_rows = optional(number)
-      }))
-      secret_arn = string
-    }))
-  })
-  default = {}
-}
-```
-
-
-### 5.2 Lambda Validator Contract
-
-All validator handlers accept unified event schema:
-
-```json
-{
-  "restoreJobId": "string",
-  "resourceType": "RDS|Aurora|DynamoDB|S3|...",
-  "createdResourceArn": "arn:aws:...",
-  "config": { "...resource specific config subset..." }
-}
-```
-Return object:
-
-
-```json
-{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "Human readable" }
-```
-
-
-### 5.3 Packaging Strategy
-
-- Single Lambda with language (Python/Node) loads `config` JSON from SSM Parameter or encrypted file in S3 (to avoid large env variables)
-- Pluggable validators registered in a dict keyed by resource type
-- Optional user-provided Lambda ARN override per resource type for complete custom logic
-
-### 5.4 Validation Logic Patterns
-
-| Resource | Strategy | Failure Conditions |
-|----------|----------|-------------------|
-| RDS/Aurora | Execute SQL checks (each inside txn, read-only) | Query error, row count mismatch, hash mismatch, timeout |
-| DynamoDB | DescribeTable + (optional) Scan limit or PartiQL key gets | Table missing, item count variance > threshold, sample hash mismatch |
-| S3 | HEAD sample objects, optional compare against manifest (object key + size + etag) | Missing objects, size/etag mismatch, manifest not accessible |
-| EBS (future) | (Optional) Attach test volume to temp instance and run FS metadata probe script | Attach failure, FS errors |
-
-## 6. Examples
-
-### 6.1 RDS Example Config
-
-```hcl
-restore_validation_config = {
-  rds = {
-    enabled = true
-    secret_arn = aws_secretsmanager_secret.rds_ro.arn
-    sql_checks = [
-      { database = "appdb", statement = "SELECT COUNT(*) c FROM customers", expected_rows = 1 },
-      { database = "appdb", statement = "SELECT sha256(string_agg(id || ':' || status, ',' ORDER BY id)) h FROM orders", expected_hash = "abc123..." }
-    ]
-  }
-}
-```
-
-### 6.2 DynamoDB Example Config
-
-```hcl
-restore_validation_config = {
-  dynamodb = {
-    enabled = true
-    tables = ["orders", "customers"]
-    checks = [
-      { table = "orders", expected_item_count = 15000 },
-      { table = "customers", key_sample = [ { pk = "CUST#123", expected_item_hash = "d41d8cd98f" } ] }
-    ]
-  }
-}
-```
-
-### 6.3 S3 Example Config
-
-```hcl
-restore_validation_config = {
-  s3 = {
-    enabled = true
-    buckets = [{
-      name = "images-bucket",
-      manifest_s3_uri = "s3://manifests-prod/images-bucket.manifest.json",
-      sample_prefixes = ["2025/09/", "2025/08/"]
-    }]
-  }
-}
-```
-
-## 7. Security & Compliance
-
-- IAM: Validators assume dedicated role with least-privilege policies (RDS: `rds-data:ExecuteStatement` / `secretsmanager:GetSecretValue`; DynamoDB: `DescribeTable`, `GetItem`, limited `Scan` with `Limit`; S3: `HeadObject`, `GetObject` for manifest)
-- Secrets: Use Secrets Manager for DB creds; do not log credentials or query data
-- KMS: Encrypt Lambda environment variables, S3 manifest bucket, and Secrets Manager secret
-- Network: For RDS/Aurora in private subnets, place Lambda in same VPC subnets with least required SG egress
-- Auditing: Structured JSON logs (include `restoreJobId`, `resourceType`, check identifiers)
-- PII Minimisation: Hash or count only; avoid selecting raw personal data rows
-- Integrity of config: Optionally sign config file (S3 object with checksum validation before use)
-
-## 8. Operational Considerations & Cost
-
-- Throttle: Concurrency controls via Step Functions + reserved concurrency on validator Lambda to avoid storm after bulk restores
-- Timeouts: Short per-check timeouts (e.g. 30s; fail fast pattern)
-- Retention Window: If deeper validation requires longer retention, expose `retain_hours_before_cleanup` variable (aligns with AWS restore testing retention concept)
-- Metrics: Emit CloudWatch custom metrics: `ValidationSuccess`, `ValidationFailure`, `ValidationDurationMs` with dimensions `ResourceType`, `PlanName`
-- Alerting: SNS topic for failures >0 in last run, or error rate > threshold across rolling period
-- Cost Levers: Limit number of SQL checks; use targeted `GetItem` vs full table scans; sample S3 objects (k=20 per prefix) unless manifest diff required
-
-## 9. Acceptance Criteria Mapping
-
-| Requirement | Design Element |
-|------------|----------------|
-| "Ability from the blueprint to run automated test to validate restoration" | EventBridge + Step Functions + validators triggered on restore completion |
-| "Test integrity of restored resource, specific to blueprint implementer" | `restore_validation_config` + per-resource plugin architecture |
-| "Define an SQL query for RDS to test integrity" | `sql_checks` array with expected rows/hash support |
-| "Customer responsible for defining and validating check" | User supplies Terraform variable config and (optionally) custom Lambda override |
-| "Step function would just allow this functionality" | State machine orchestrates and records results via `PutRestoreValidationResult` |
-
-## 10. Future Enhancements
-
-- Add cross-account validation (restore to isolated test account, assume role back)
-- Support FSx / EFS mount probing using Fargate task
-- Provide Terraform module subfolder `validation` generating Step Functions + default validator Lambda
-- Add canned dashboards (CloudWatch) for validation pass rate & duration
-
-## 11. Terraform Module Additions (Summary)
-
-Minimal initial scope:
-
-1. New optional module `aws-backup-validation` OR integrated into `aws-backup-source` behind feature flag `enable_restore_validation`
-2. Resources:
-   - EventBridge rule
-   - Step Functions state machine (JSON from templatefile)
-   - IAM roles/policies (state machine + lambda)
-   - Validator Lambda (zip from local build or external source)
-   - SSM Parameter / S3 object for config JSON
-3. Variables: `enable_restore_validation`, `restore_validation_config`, `custom_validator_lambda_arns` (map)
-4. Outputs: `restore_validation_state_machine_arn`, `restore_validation_config_parameter_arn`
-
-Current prototype implementation lives in `modules/aws-backup-validation` and provides a minimal Lambda + Step Functions + EventBridge rule path. Future iterations should harden IAM scoping and expand validator logic prior to production adoption.
-
-## 12. Example User Flow
-
-1. Enable restore testing (already done with existing plan resources)
-2. Set `enable_restore_validation = true`
-3. Provide `restore_validation_config` with at least one resource type
-4. Apply Terraform – deploys validation infra
-5. Wait for scheduled restore test; Step Functions records validation results
-6. View status in AWS Backup Console / CloudWatch dashboard
-
-## 13. Risks & Mitigations
-
-| Risk | Mitigation |
-|------|------------|
-| Long-running SQL leads to Lambda timeout | Enforce per-query timeout + limit operations (SELECT only) |
-| Validator failure blocks result publishing | Wrap each validator in try/catch; on unhandled exception mark FAILED with reason |
-| Sensitive data leakage in logs | Scrub query parameters and row data; log only counts + hashes |
-| Drift between Terraform config and live validator config | Version config (include checksum) and log version per run |
-| Excess costs from scanning large DynamoDB tables | Use item count from `DescribeTable` and targeted sample keys, avoid full scans |
-
-## 14. Open Questions
-
-- Provide managed library of validation query templates? (Out of initial scope)
-- Should retention hours be explicitly configurable per selection via Terraform? (Potential future variable)
-- Add option for concurrency-limited validation queue (SQS + Lambda) instead of Step Functions? (Future scale consideration)
-