diff --git a/docs/diagrams/restore-validation-sequence.png b/docs/diagrams/restore-validation-sequence.png new file mode 100644 index 0000000..938d417 Binary files /dev/null and b/docs/diagrams/restore-validation-sequence.png differ diff --git a/docs/diagrams/restore-validation-sequence.puml b/docs/diagrams/restore-validation-sequence.puml new file mode 100644 index 0000000..2ecdf38 --- /dev/null +++ b/docs/diagrams/restore-validation-sequence.puml @@ -0,0 +1,110 @@ +@startuml restore-validation-sequence +' Title & Legend +!theme plain +skinparam ParticipantPadding 8 +skinparam BoxPadding 6 +skinparam Shadowing false +skinparam ArrowThickness 1 +skinparam ArrowColor #2d5d86 +skinparam ActorStyle awesome +skinparam SequenceMessageAlign center +skinparam BackgroundColor #ffffff + +title AWS Backup Restore Testing & Validation Flow + +legend left + This diagram illustrates the post-restore validation workflow: + 1. Scheduled restore tests run via an AWS Backup Restore Testing Plan. + 2. When a restore job COMPLETES, an EventBridge rule targets a Step Functions + state machine that orchestrates validation. + 3. Lambda validator loads per-resource validation config from SSM Parameter Store + and executes resource‑type specific checks (e.g. RDS SQL assertions, + DynamoDB item sampling, S3 manifest / object probes). + 4. Validation result is published back to AWS Backup using PutRestoreValidationResult. +endlegend + +actor User as U +participant "AWS Backup Restore\nTesting Plan" as Plan +participant "AWS Backup\n(Service)" as Backup +participant "Restore Job" as Restore +participant "EventBridge Rule" as EB +participant "Step Functions\n(State Machine)" as SFN +participant "State: Enrich" as Enrich +participant "State: Route" as Route +participant "Lambda Validator" as Lambda +participant "SSM Parameter\n(Store Config)" as SSM +participant "Resource APIs\n(RDS | DynamoDB | S3 | etc.)" as APIs +participant "AWS Backup API\n(PutRestoreValidationResult)" as ResultAPI +participant "CloudWatch Logs" as Logs + +' 1. Scheduled restore initiated +U -> Plan : (Schedule configured) +Plan -> Backup : Initiate restore test jobs (per selection) +Backup -> Restore ++ : Create restore job(s) + +' 2. Restore completes +Restore -> Backup : Status = COMPLETED (success) +Backup -> EB : Event: Restore Job State Change\n(detail.status = COMPLETED) + +' 3. EventBridge triggers Step Functions +EB -> SFN : StartExecution (input = restore job event) +activate SFN +SFN -> Enrich : Pass original event / add metadata +activate Enrich +Enrich --> SFN : Enriched context +deactivate Enrich + +SFN -> Route : Determine resourceType +activate Route + +alt Supported resource type + Route -> Lambda : Invoke validator (payload: job + configRef) + activate Lambda + Lambda -> SSM : Get config parameter + SSM --> Lambda : JSON config + Lambda -> APIs : Perform type-specific checks + APIs --> Lambda : Check results / metrics + Lambda -> Logs : Structured validation logs + Lambda --> SFN : { status: SUCCESSFUL | FAILED, details } + deactivate Lambda +else Unsupported / disabled type + Route --> SFN : { status: SKIPPED, reason } +end + +deactivate Route + +' 4. Publish result back to AWS Backup +SFN -> ResultAPI : PutRestoreValidationResult\n(status, message, resourceType, metadata) +ResultAPI --> SFN : 200 OK + +SFN -> Logs : State machine execution log (success path) +SFN --> EB : (Implicit: EventBridge metrics / tracing) +SFN --> U : (Optional surfacing via reporting / notifications) + +SFN --> Backup : (Validation outcome associated to restore job) +deactivate SFN + +== Failure Handling == + +group Validator Error Path + Lambda -> Logs : Error + stack trace + Lambda --> SFN : { status: FAILED, errorMessage } + SFN -> ResultAPI : PutRestoreValidationResult (FAILED) + ResultAPI --> SFN : 200 OK +end + +== Notes == +note over Lambda,APIs + Validation logic pluggable per resource type. + Future extensions: metrics, alarms, custom plugins. +end note + +note over SFN + States (conceptual): + 1. EnrichRestoreJob + 2. RouteByResourceType + 3. InvokeValidator (task) OR SkipUnsupported (pass) + 4. PublishResult +end note + +@enduml diff --git a/docs/restore-testing-design.md b/docs/restore-testing-design.md new file mode 100644 index 0000000..cd560ba --- /dev/null +++ b/docs/restore-testing-design.md @@ -0,0 +1,310 @@ +# AWS Backup Restore Testing Validation & Integrity Design + +## 1. Objectives + +Provide a blueprint extension that not only provisions AWS Backup Restore Testing Plans (already partially implemented via `awscc_backup_restore_testing_plan` and selections) but also validates that restored resources are *functional* and *internally consistent*. Users (blueprint implementers) define integrity checks per resource type (e.g. SQL query for RDS/Aurora, manifest verification for S3, item checks for DynamoDB) executed automatically after AWS Backup restore tests complete. + +## 2. High-Level Architecture + +![end-to-end visual of the event-driven validation workflow](diagrams/restore-validation-sequence.png) + +```text +AWS Backup Restore Testing Plan (scheduled) + │ (runs restore jobs) + ▼ +Restore Test Jobs (Test restore of latest/random recovery points) + │ emit EventBridge events (Restore Job State Change: COMPLETED) + ▼ +EventBridge Rule (filters status=COMPLETED + restoreTestingPlanArn) + │ + ▼ +Step Functions State Machine (or direct Lambda) <── optional batching fan‑in + 1. Fetch restore job details + 2. Dispatch per resource-type validator (Lambda / Fargate / custom) + 3. Execute user-defined integrity logic (SQL / API / S3 diff etc.) + 4. Aggregate results + 5. Call PutRestoreValidationResult (per restore job) + 6. Emit metrics + SNS / EventBridge notifications + │ + ▼ +CloudWatch Metrics / Logs / Alarms + Backup Console Validation Status +``` + +### Why Step Functions? + +- Orchestrates retries, parallel fan-out per restored resource +- Standardises timeout + backoff policies +- Simplifies conditional branching for resource types +- Enables centralised audit trail for validation workflow + +A simpler single Lambda path remains possible for minimal setups; design supports either. + +## 3. Data & Control Flows + +| Flow | Source → Target | Notes | +|------|-----------------|-------| +| A | AWS Backup → EventBridge | "Restore Job State Change" event, includes `restoreJobId`, `resourceType`, `createdResourceArn`, `restoreTestingPlanArn` | +| B | EventBridge → Step Functions | Input filtered by plan ARN / resource types | +| C | Step Functions → AWS Backup API | `DescribeRestoreJob` for enrichment | +| D | Step Functions → Validator Lambdas | One per resource type OR generic dispatcher | +| E | Validators → Target resource | Run integrity checks (SQL, scan, HEAD, etc.) | +| F | Validators → AWS Backup | `PutRestoreValidationResult(ValidationStatus=SUCCESSFUL\|FAILED\|SKIPPED)` | +| G | Step Functions → CloudWatch / SNS | Emit metrics, structured JSON log, optional alert | + +## 4. State Machine Definition (Express or Standard) + +Recommended: **Standard** (because restores may take hours; we only start after COMPLETED but validation might be longer running for large datasets). Express acceptable if you guarantee short validations. + +Proposed states (Amazon States Language pseudo): + +```json +{ + "Comment": "Restore Test Validation Orchestrator", + "StartAt": "Init", + + "States": { + "Init": { "Type": "Pass", "ResultPath": "$.context", "Next": "EnrichRestoreJob" }, + "EnrichRestoreJob": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:describeRestoreJob", "Parameters": { "RestoreJobId": "$.detail.restoreJobId" }, "ResultPath": "$.restoreJob", "Next": "RouteByResourceType" }, + "RouteByResourceType": { "Type": "Choice", "Choices": [ + { "Variable": "$.detail.resourceType", "StringEquals": "Aurora", "Next": "AuroraValidation" }, + { "Variable": "$.detail.resourceType", "StringEquals": "RDS", "Next": "RDSValidation" }, + { "Variable": "$.detail.resourceType", "StringEquals": "DynamoDB", "Next": "DynamoValidation" }, + { "Variable": "$.detail.resourceType", "StringEquals": "S3", "Next": "S3Validation" } + ], "Default": "GenericSkip" }, + "AuroraValidation": { "Type": "Task", "Resource": "${lambda_arn_aurora}" , "ResultPath": "$.validation", "Next": "PublishResult" }, + "RDSValidation": { "Type": "Task", "Resource": "${lambda_arn_rds}" , "ResultPath": "$.validation", "Next": "PublishResult" }, + "DynamoValidation": { "Type": "Task", "Resource": "${lambda_arn_dynamo}" , "ResultPath": "$.validation", "Next": "PublishResult" }, + "S3Validation": { "Type": "Task", "Resource": "${lambda_arn_s3}" , "ResultPath": "$.validation", "Next": "PublishResult" }, + "GenericSkip": { "Type": "Pass", "Result": { "status": "SKIPPED", "message": "No validator implemented for resourceType" }, "ResultPath": "$.validation", "Next": "PublishResult" }, + "PublishResult": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:backup:putRestoreValidationResult", "Parameters": { "RestoreJobId": "$.detail.restoreJobId", "ValidationStatus": "$.validation.status", "ValidationStatusMessage": "$.validation.message" }, "Next": "EmitMetrics" }, + "EmitMetrics": { "Type": "Task", "Resource": "${lambda_arn_metrics}", "End": true } + } +} +``` + +Notes: + +- `${lambda_arn_*}` produced conditionally via Terraform based on enabled validators. +- Timeout & retry policies applied per Task (e.g. RDS 5 min, S3 2 min, Dynamo 1 min) with `Retry` blocks. +- Could collapse validators into one generic Lambda with plugin pattern. + +## 5. Extensibility Interface + +Users supply validation definitions via Terraform variables consumed by validator Lambda(s). + +### 5.1 Terraform Variables (additions) + +```hcl +variable "restore_validation_config" { + description = "Map keyed by resource type containing validation directives." + type = object({ + rds = optional(object({ + enabled = bool + cluster_identifiers = optional(list(string)) + sql_checks = list(object({ + database = string + statement = string + expected_rows = optional(number) + expected_hash = optional(string) # SHA256 of concatenated row values + timeout_seconds = optional(number) + })) + secret_arn = string # AWS Secrets Manager ARN for master creds or read-only + })) + dynamodb = optional(object({ + enabled = bool + tables = list(string) + checks = list(object({ + table = string + expected_item_count = optional(number) + key_sample = optional(list(object({ + pk = string + sk = optional(string) + expected_item_hash = optional(string) + }))) + })) + })) + s3 = optional(object({ + enabled = bool + buckets = list(object({ + name = string + manifest_s3_uri = optional(string) # points to authoritative manifest + sample_prefixes = optional(list(string)) + compare_object_tags = optional(bool) + })) + })) + aurora = optional(object({ + enabled = bool + clusters = list(string) + sql_checks = list(object({ + cluster_endpoint = optional(string) + database = string + statement = string + expected_rows = optional(number) + })) + secret_arn = string + })) + }) + default = {} +} +``` + + +### 5.2 Lambda Validator Contract + +All validator handlers accept unified event schema: + +```json +{ + "restoreJobId": "string", + "resourceType": "RDS|Aurora|DynamoDB|S3|...", + "createdResourceArn": "arn:aws:...", + "config": { "...resource specific config subset..." } +} +``` +Return object: + + +```json +{ "status": "SUCCESSFUL|FAILED|SKIPPED", "message": "Human readable" } +``` + + +### 5.3 Packaging Strategy + +- Single Lambda with language (Python/Node) loads `config` JSON from SSM Parameter or encrypted file in S3 (to avoid large env variables) +- Pluggable validators registered in a dict keyed by resource type +- Optional user-provided Lambda ARN override per resource type for complete custom logic + +### 5.4 Validation Logic Patterns + +| Resource | Strategy | Failure Conditions | +|----------|----------|-------------------| +| RDS/Aurora | Execute SQL checks (each inside txn, read-only) | Query error, row count mismatch, hash mismatch, timeout | +| DynamoDB | DescribeTable + (optional) Scan limit or PartiQL key gets | Table missing, item count variance > threshold, sample hash mismatch | +| S3 | HEAD sample objects, optional compare against manifest (object key + size + etag) | Missing objects, size/etag mismatch, manifest not accessible | +| EBS (future) | (Optional) Attach test volume to temp instance and run FS metadata probe script | Attach failure, FS errors | + +## 6. Examples + +### 6.1 RDS Example Config + +```hcl +restore_validation_config = { + rds = { + enabled = true + secret_arn = aws_secretsmanager_secret.rds_ro.arn + sql_checks = [ + { database = "appdb", statement = "SELECT COUNT(*) c FROM customers", expected_rows = 1 }, + { database = "appdb", statement = "SELECT sha256(string_agg(id || ':' || status, ',' ORDER BY id)) h FROM orders", expected_hash = "abc123..." } + ] + } +} +``` + +### 6.2 DynamoDB Example Config + +```hcl +restore_validation_config = { + dynamodb = { + enabled = true + tables = ["orders", "customers"] + checks = [ + { table = "orders", expected_item_count = 15000 }, + { table = "customers", key_sample = [ { pk = "CUST#123", expected_item_hash = "d41d8cd98f" } ] } + ] + } +} +``` + +### 6.3 S3 Example Config + +```hcl +restore_validation_config = { + s3 = { + enabled = true + buckets = [{ + name = "images-bucket", + manifest_s3_uri = "s3://manifests-prod/images-bucket.manifest.json", + sample_prefixes = ["2025/09/", "2025/08/"] + }] + } +} +``` + +## 7. Security & Compliance + +- IAM: Validators assume dedicated role with least-privilege policies (RDS: `rds-data:ExecuteStatement` / `secretsmanager:GetSecretValue`; DynamoDB: `DescribeTable`, `GetItem`, limited `Scan` with `Limit`; S3: `HeadObject`, `GetObject` for manifest) +- Secrets: Use Secrets Manager for DB creds; do not log credentials or query data +- KMS: Encrypt Lambda environment variables, S3 manifest bucket, and Secrets Manager secret +- Network: For RDS/Aurora in private subnets, place Lambda in same VPC subnets with least required SG egress +- Auditing: Structured JSON logs (include `restoreJobId`, `resourceType`, check identifiers) +- PII Minimisation: Hash or count only; avoid selecting raw personal data rows +- Integrity of config: Optionally sign config file (S3 object with checksum validation before use) + +## 8. Operational Considerations & Cost + +- Throttle: Concurrency controls via Step Functions + reserved concurrency on validator Lambda to avoid storm after bulk restores +- Timeouts: Short per-check timeouts (e.g. 30s; fail fast pattern) +- Retention Window: If deeper validation requires longer retention, expose `retain_hours_before_cleanup` variable (aligns with AWS restore testing retention concept) +- Metrics: Emit CloudWatch custom metrics: `ValidationSuccess`, `ValidationFailure`, `ValidationDurationMs` with dimensions `ResourceType`, `PlanName` +- Alerting: SNS topic for failures >0 in last run, or error rate > threshold across rolling period +- Cost Levers: Limit number of SQL checks; use targeted `GetItem` vs full table scans; sample S3 objects (k=20 per prefix) unless manifest diff required + +## 9. Acceptance Criteria Mapping + +| Requirement | Design Element | +|------------|----------------| +| "Ability from the blueprint to run automated test to validate restoration" | EventBridge + Step Functions + validators triggered on restore completion | +| "Test integrity of restored resource, specific to blueprint implementer" | `restore_validation_config` + per-resource plugin architecture | +| "Define an SQL query for RDS to test integrity" | `sql_checks` array with expected rows/hash support | +| "Customer responsible for defining and validating check" | User supplies Terraform variable config and (optionally) custom Lambda override | +| "Step function would just allow this functionality" | State machine orchestrates and records results via `PutRestoreValidationResult` | + +## 10. Future Enhancements + +- Add cross-account validation (restore to isolated test account, assume role back) +- Support FSx / EFS mount probing using Fargate task +- Provide Terraform module subfolder `validation` generating Step Functions + default validator Lambda +- Add canned dashboards (CloudWatch) for validation pass rate & duration + +## 11. Terraform Module Additions (Summary) + +Minimal initial scope: + +1. New optional module `aws-backup-validation` OR integrated into `aws-backup-source` behind feature flag `enable_restore_validation` +2. Resources: + - EventBridge rule + - Step Functions state machine (JSON from templatefile) + - IAM roles/policies (state machine + lambda) + - Validator Lambda (zip from local build or external source) + - SSM Parameter / S3 object for config JSON +3. Variables: `enable_restore_validation`, `restore_validation_config`, `custom_validator_lambda_arns` (map) +4. Outputs: `restore_validation_state_machine_arn`, `restore_validation_config_parameter_arn` + +Current prototype implementation lives in `modules/aws-backup-validation` and provides a minimal Lambda + Step Functions + EventBridge rule path. Future iterations should harden IAM scoping and expand validator logic prior to production adoption. + +## 12. Example User Flow + +1. Enable restore testing (already done with existing plan resources) +2. Set `enable_restore_validation = true` +3. Provide `restore_validation_config` with at least one resource type +4. Apply Terraform – deploys validation infra +5. Wait for scheduled restore test; Step Functions records validation results +6. View status in AWS Backup Console / CloudWatch dashboard + +## 13. Risks & Mitigations + +| Risk | Mitigation | +|------|------------| +| Long-running SQL leads to Lambda timeout | Enforce per-query timeout + limit operations (SELECT only) | +| Validator failure blocks result publishing | Wrap each validator in try/catch; on unhandled exception mark FAILED with reason | +| Sensitive data leakage in logs | Scrub query parameters and row data; log only counts + hashes | +| Drift between Terraform config and live validator config | Version config (include checksum) and log version per run | +| Excess costs from scanning large DynamoDB tables | Use item count from `DescribeTable` and targeted sample keys, avoid full scans | + +## 14. Open Questions + +- Provide managed library of validation query templates? (Out of initial scope) +- Should retention hours be explicitly configurable per selection via Terraform? (Potential future variable) +- Add option for concurrency-limited validation queue (SQS + Lambda) instead of Step Functions? (Future scale consideration) + diff --git a/modules/aws-backup-validation/README.md b/modules/aws-backup-validation/README.md new file mode 100644 index 0000000..4f4c835 --- /dev/null +++ b/modules/aws-backup-validation/README.md @@ -0,0 +1,41 @@ +# aws-backup-validation Module + +Prototype module that deploys infrastructure to validate AWS Backup Restore Testing jobs. + +## Components + +- Lambda validator (pluggable placeholder) reading config from SSM Parameter +- Step Functions state machine orchestrating describe + validator + publish result +- EventBridge rule triggering on restore job COMPLETED for a specific restore testing plan ARN +- IAM roles/policies (least-privilege baseline – refine for production) + +## Inputs + +Refer to `variables.tf` for full list. Key variables: + +- `restore_testing_plan_arn` (required) +- `validation_config_json` JSON document with resource-type validation definitions + +## Outputs + +- `state_machine_arn` +- `validator_lambda_arn` +- `config_parameter_name` + +## Example + +```hcl +module "backup_validation" { + source = "../modules/aws-backup-validation" + restore_testing_plan_arn = awscc_backup_restore_testing_plan.backup_restore_testing_plan.arn + validation_config_json = jsonencode({ + rds = { sql_checks = [{ database = "appdb", statement = "SELECT 1" }] } + }) +} +``` + +## Next Steps + +- Expand validator logic (RDS via rds-data, S3 manifest comparisons, DynamoDB samples) +- Add CloudWatch metrics & alarms +- Add optional custom Lambda override mapping per resource type diff --git a/modules/aws-backup-validation/iam.tf b/modules/aws-backup-validation/iam.tf new file mode 100644 index 0000000..3c11e66 --- /dev/null +++ b/modules/aws-backup-validation/iam.tf @@ -0,0 +1,112 @@ +locals { + validator_lambda_name = "${var.name_prefix}-validator" + state_machine_name = "${var.name_prefix}-state-machine" + ssm_param_name = "/${var.name_prefix}/config" +} + +resource "aws_iam_role" "validator_lambda" { + name = "${local.validator_lambda_name}-role" + assume_role_policy = data.aws_iam_policy_document.lambda_assume.json + tags = var.tags +} + +data "aws_iam_policy_document" "lambda_assume" { + statement { + effect = "Allow" + principals { type = "Service" identifiers = ["lambda.amazonaws.com"] } + actions = ["sts:AssumeRole"] + } +} + +resource "aws_iam_role_policy" "validator_basic" { + name = "${local.validator_lambda_name}-basic" + role = aws_iam_role.validator_lambda.id + policy = data.aws_iam_policy_document.validator_policy.json +} + +data "aws_iam_policy_document" "validator_policy" { + statement { + sid = "Logs" + effect = "Allow" + actions = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"] + resources = ["arn:aws:logs:*:*:*" ] + } + statement { + sid = "DescribeRestoreJob" + effect = "Allow" + actions = ["backup:DescribeRestoreJob", "backup:PutRestoreValidationResult"] + resources = ["*"] + } + statement { + sid = "GetConfig" + effect = "Allow" + actions = ["ssm:GetParameter", "ssm:GetParameters", "ssm:GetParameterHistory"] + resources = ["arn:aws:ssm:*:*:parameter${local.ssm_param_name}"] + } + # Add minimal read for services (extend if needed by resource type validators) + statement { + sid = "RDSData" + effect = "Allow" + actions = ["rds-data:ExecuteStatement"] + resources = ["*"] + } + statement { + sid = "DynamoRead" + effect = "Allow" + actions = ["dynamodb:DescribeTable", "dynamodb:GetItem"] + resources = ["*"] + } + statement { + sid = "S3Head" + effect = "Allow" + actions = ["s3:HeadObject", "s3:GetObject"] + resources = ["*"] + } + statement { + sid = "SecretsRead" + effect = "Allow" + actions = ["secretsmanager:GetSecretValue"] + resources = ["*"] + } +} + +resource "aws_iam_role" "state_machine" { + name = "${local.state_machine_name}-role" + assume_role_policy = data.aws_iam_policy_document.sfn_assume.json + tags = var.tags +} + +data "aws_iam_policy_document" "sfn_assume" { + statement { + effect = "Allow" + principals { type = "Service" identifiers = ["states.amazonaws.com"] } + actions = ["sts:AssumeRole"] + } +} + +resource "aws_iam_role_policy" "state_machine_policy" { + name = "${local.state_machine_name}-policy" + role = aws_iam_role.state_machine.id + policy = data.aws_iam_policy_document.state_machine_policy.json +} + +data "aws_iam_policy_document" "state_machine_policy" { + statement { + sid = "InvokeValidator" + effect = "Allow" + actions = ["lambda:InvokeFunction"] + resources = [aws_lambda_function.validator.arn] + } + statement { + sid = "BackupCalls" + effect = "Allow" + actions = ["backup:DescribeRestoreJob", "backup:PutRestoreValidationResult"] + resources = ["*"] + } + statement { + sid = "Logs" + effect = "Allow" + actions = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"] + resources = ["arn:aws:logs:*:*:*" ] + } +} diff --git a/modules/aws-backup-validation/lambda.py b/modules/aws-backup-validation/lambda.py new file mode 100644 index 0000000..aa4fe9e --- /dev/null +++ b/modules/aws-backup-validation/lambda.py @@ -0,0 +1,75 @@ +import json +import os +import boto3 +import hashlib + +ssm = boto3.client('ssm') +backup = boto3.client('backup') +secrets = boto3.client('secretsmanager') +rds_data = boto3.client('rds-data') +dynamodb = boto3.client('dynamodb') +s3 = boto3.client('s3') + +CONFIG_PARAM_NAME = os.environ.get('CONFIG_PARAM_NAME') + +_cached_config = None + +def load_config(): + global _cached_config + if _cached_config is not None: + return _cached_config + if not CONFIG_PARAM_NAME: + _cached_config = {} + return _cached_config + resp = ssm.get_parameter(Name=CONFIG_PARAM_NAME) + _cached_config = json.loads(resp['Parameter']['Value']) + return _cached_config + +def handler(event, context): + # Event expected from Step Functions state machine + restore_job_id = event.get('detail', {}).get('restoreJobId') or event.get('restoreJobId') + resource_type = event.get('detail', {}).get('resourceType') or event.get('resourceType') + created_arn = event.get('detail', {}).get('createdResourceArn') or event.get('createdResourceArn') + + config = load_config() + result = {"status": "SKIPPED", "message": f"No validator for {resource_type}"} + + try: + if resource_type in ("RDS", "Aurora"): + result = validate_rds_like(resource_type, created_arn, config.get('rds') or config.get('aurora')) + elif resource_type == "DynamoDB": + result = validate_dynamodb(created_arn, config.get('dynamodb')) + elif resource_type == "S3": + result = validate_s3(created_arn, config.get('s3')) + except Exception as exc: # noqa + result = {"status": "FAILED", "message": f"Unhandled validator error: {exc}"} + + return result + +def validate_rds_like(resource_type, arn, cfg): + if not cfg or not cfg.get('sql_checks'): + return {"status": "SKIPPED", "message": "No sql_checks configured"} + failures = [] + for chk in cfg['sql_checks']: + stmt = chk['statement'] + db = chk['database'] + # Placeholder: In real implementation we would look up secret and cluster endpoint + try: + # rds-data call would require secretArn + resourceArn for serverless Aurora or HTTP endpoint; omitted here + pass + except Exception as exc: # noqa + failures.append(f"{db}: {exc}") + if failures: + return {"status": "FAILED", "message": "; ".join(failures)[:1000]} + return {"status": "SUCCESSFUL", "message": "All RDS/Aurora checks passed (placeholder)"} + +def validate_dynamodb(arn, cfg): + if not cfg or not cfg.get('tables'): + return {"status": "SKIPPED", "message": "No dynamodb tables configured"} + # Placeholder logic only + return {"status": "SUCCESSFUL", "message": "DynamoDB validation placeholder"} + +def validate_s3(arn, cfg): + if not cfg or not cfg.get('buckets'): + return {"status": "SKIPPED", "message": "No s3 buckets configured"} + return {"status": "SUCCESSFUL", "message": "S3 validation placeholder"} diff --git a/modules/aws-backup-validation/lambda.tf b/modules/aws-backup-validation/lambda.tf new file mode 100644 index 0000000..b76410f --- /dev/null +++ b/modules/aws-backup-validation/lambda.tf @@ -0,0 +1,34 @@ +data "archive_file" "validator_zip" { + type = "zip" + source_file = "${path.module}/dist/index.js" + output_path = "${path.module}/lambda.zip" +} + +resource "aws_ssm_parameter" "config" { + name = local.ssm_param_name + type = "String" + value = var.validation_config_json + tags = var.tags +} + +resource "aws_lambda_function" "validator" { + function_name = local.validator_lambda_name + role = aws_iam_role.validator_lambda.arn + runtime = var.lambda_runtime + handler = "index.handler" + filename = data.archive_file.validator_zip.output_path + source_code_hash = data.archive_file.validator_zip.output_base64sha256 + timeout = var.lambda_timeout + environment { + variables = { + CONFIG_PARAM_NAME = aws_ssm_parameter.config.name + } + } + tags = var.tags +} + +resource "aws_cloudwatch_log_group" "validator" { + name = "/aws/lambda/${aws_lambda_function.validator.function_name}" + retention_in_days = var.log_retention_days + tags = var.tags +} diff --git a/modules/aws-backup-validation/outputs.tf b/modules/aws-backup-validation/outputs.tf new file mode 100644 index 0000000..ed5b501 --- /dev/null +++ b/modules/aws-backup-validation/outputs.tf @@ -0,0 +1,14 @@ +output "state_machine_arn" { + description = "ARN of the validation Step Functions state machine" + value = aws_sfn_state_machine.validation.arn +} + +output "validator_lambda_arn" { + description = "ARN of the validator lambda" + value = aws_lambda_function.validator.arn +} + +output "config_parameter_name" { + description = "Name of SSM parameter storing validation config" + value = aws_ssm_parameter.config.name +} diff --git a/modules/aws-backup-validation/package.json b/modules/aws-backup-validation/package.json new file mode 100644 index 0000000..d6ac2d7 --- /dev/null +++ b/modules/aws-backup-validation/package.json @@ -0,0 +1,19 @@ +{ + "name": "aws-backup-validation-lambda", + "version": "0.1.0", + "private": true, + "description": "Validator Lambda for AWS Backup restore testing (scaffold)", + "license": "UNLICENSED", + "type": "module", + "scripts": { + "build": "tsc", + "clean": "rimraf dist || rm -rf dist", + "package": "npm run build" + }, + "devDependencies": { + "@types/aws-lambda": "^8.10.129", + "@types/node": "^20.11.30", + "typescript": "^5.4.0", + "rimraf": "^5.0.5" + } +} diff --git a/modules/aws-backup-validation/src/index.ts b/modules/aws-backup-validation/src/index.ts new file mode 100644 index 0000000..359713d --- /dev/null +++ b/modules/aws-backup-validation/src/index.ts @@ -0,0 +1,71 @@ +import { SSMClient, GetParameterCommand } from '@aws-sdk/client-ssm'; +import { BackupClient } from '@aws-sdk/client-backup'; +import { SecretsManagerClient } from '@aws-sdk/client-secrets-manager'; +import { RDSDataClient } from '@aws-sdk/client-rds-data'; +import { DynamoDBClient } from '@aws-sdk/client-dynamodb'; +import { S3Client } from '@aws-sdk/client-s3'; +import type { Context } from 'aws-lambda'; + +const ssm = new SSMClient({}); +const backup = new BackupClient({}); // reserved for future use +const secrets = new SecretsManagerClient({}); // future +const rdsData = new RDSDataClient({}); // future +const dynamodb = new DynamoDBClient({}); // future more detailed calls +const s3 = new S3Client({}); // future + +const CONFIG_PARAM_NAME = process.env.CONFIG_PARAM_NAME; +let cachedConfig: any | null = null; + +async function loadConfig(): Promise { + if (cachedConfig) return cachedConfig; + if (!CONFIG_PARAM_NAME) { + cachedConfig = {}; + return cachedConfig; + } + const resp = await ssm.send(new GetParameterCommand({ Name: CONFIG_PARAM_NAME })); + cachedConfig = resp.Parameter?.Value ? JSON.parse(resp.Parameter.Value) : {}; + return cachedConfig; +} + +interface ValidationResult { status: 'SUCCESSFUL' | 'FAILED' | 'SKIPPED'; message: string; } + +export const handler = async (event: any, _context: Context): Promise => { + const restoreJobId = event?.detail?.restoreJobId || event?.restoreJobId; + const resourceType = event?.detail?.resourceType || event?.resourceType; + const createdArn = event?.detail?.createdResourceArn || event?.createdResourceArn; + + const config = await loadConfig(); + let result: ValidationResult = { status: 'SKIPPED', message: `No validator for ${resourceType}` }; + + try { + if (resourceType === 'RDS' || resourceType === 'Aurora') { + result = await validateRdsLike(resourceType, createdArn, config.rds || config.aurora); + } else if (resourceType === 'DynamoDB') { + result = await validateDynamoDb(createdArn, config.dynamodb); + } else if (resourceType === 'S3') { + result = await validateS3(createdArn, config.s3); + } + } catch (err: any) { + result = { status: 'FAILED', message: `Unhandled validator error: ${err?.message || String(err)}` }; + } + + return result; +}; + +async function validateRdsLike(resourceType: string, arn: string, cfg: any): Promise { + if (!cfg || !cfg.sql_checks) { + return { status: 'SKIPPED', message: 'No sql_checks configured' }; + } + // Placeholder: iterate over cfg.sql_checks and (in future) execute statements via rds-data. + return { status: 'SUCCESSFUL', message: 'All RDS/Aurora checks passed (placeholder)' }; +} + +async function validateDynamoDb(arn: string, cfg: any): Promise { + if (!cfg || !cfg.tables) return { status: 'SKIPPED', message: 'No dynamodb tables configured' }; + return { status: 'SUCCESSFUL', message: 'DynamoDB validation placeholder' }; +} + +async function validateS3(arn: string, cfg: any): Promise { + if (!cfg || !cfg.buckets) return { status: 'SKIPPED', message: 'No s3 buckets configured' }; + return { status: 'SUCCESSFUL', message: 'S3 validation placeholder' }; +} diff --git a/modules/aws-backup-validation/statemachine.json.tpl b/modules/aws-backup-validation/statemachine.json.tpl new file mode 100644 index 0000000..c1b7949 --- /dev/null +++ b/modules/aws-backup-validation/statemachine.json.tpl @@ -0,0 +1,49 @@ +{ + "Comment": "Restore Test Validation Orchestrator", + "StartAt": "EnrichRestoreJob", + "States": { + "EnrichRestoreJob": { + "Type": "Task", + "Resource": "arn:aws:states:::aws-sdk:backup:describeRestoreJob", + "Parameters": {"RestoreJobId.$": "$.detail.restoreJobId"}, + "ResultPath": "$.restoreJob", + "Next": "Route" + }, + "Route": { + "Type": "Choice", + "Choices": [ + {"Variable": "$.detail.resourceType", "StringEquals": "Aurora", "Next": "InvokeValidator"}, + {"Variable": "$.detail.resourceType", "StringEquals": "RDS", "Next": "InvokeValidator"}, + {"Variable": "$.detail.resourceType", "StringEquals": "DynamoDB", "Next": "InvokeValidator"}, + {"Variable": "$.detail.resourceType", "StringEquals": "S3", "Next": "InvokeValidator"} + ], + "Default": "Skip" + }, + "InvokeValidator": { + "Type": "Task", + "Resource": "arn:aws:states:::lambda:invoke", + "OutputPath": "$.Payload", + "Parameters": { + "FunctionName": "${lambda_arn}", + "Payload.$": "$" + }, + "Next": "PublishResult" + }, + "Skip": { + "Type": "Pass", + "Result": {"status": "SKIPPED", "message": "No validator implemented"}, + "ResultPath": "$.validation", + "Next": "PublishResult" + }, + "PublishResult": { + "Type": "Task", + "Resource": "arn:aws:states:::aws-sdk:backup:putRestoreValidationResult", + "Parameters": { + "RestoreJobId.$": "$.detail.restoreJobId", + "ValidationStatus.$": "$.status", + "ValidationStatusMessage.$": "$.message" + }, + "End": true + } + } +} diff --git a/modules/aws-backup-validation/statemachine.tf b/modules/aws-backup-validation/statemachine.tf new file mode 100644 index 0000000..16b4188 --- /dev/null +++ b/modules/aws-backup-validation/statemachine.tf @@ -0,0 +1,76 @@ +locals { + statemachine_definition = templatefile("${path.module}/statemachine.json.tpl", { + lambda_arn = aws_lambda_function.validator.arn + }) +} + +resource "aws_sfn_state_machine" "validation" { + name = local.state_machine_name + role_arn = aws_iam_role.state_machine.arn + definition = local.statemachine_definition + tags = var.tags +} + +resource "aws_iam_role_policy" "allow_sfn_logs" { + name = "${local.state_machine_name}-logs" + role = aws_iam_role.state_machine.id + policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Effect = "Allow" + Action = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ] + Resource = "arn:aws:logs:*:*:*" + }] + }) +} + +# EventBridge Rule for restore job completion +resource "aws_cloudwatch_event_rule" "restore_completed" { + name = "${var.name_prefix}-restore-completed" + description = "Triggers validation on restore job completion" + event_pattern = jsonencode({ + source = ["aws.backup"], + "detail-type" = ["Restore Job State Change"], + detail = { + status = ["COMPLETED"] + restoreTestingPlanArn = [ var.restore_testing_plan_arn ] + } + }) + tags = var.tags +} + +resource "aws_cloudwatch_event_target" "sfn_target" { + rule = aws_cloudwatch_event_rule.restore_completed.name + target_id = "${var.name_prefix}-sfn" + arn = aws_sfn_state_machine.validation.arn + role_arn = aws_iam_role.eventbridge_invoke.arn +} + +resource "aws_iam_role" "eventbridge_invoke" { + name = "${var.name_prefix}-events-invoke-sfn-role" + assume_role_policy = data.aws_iam_policy_document.events_assume.json + tags = var.tags +} + +data "aws_iam_policy_document" "events_assume" { + statement { + effect = "Allow" + principals { type = "Service" identifiers = ["events.amazonaws.com"] } + actions = ["sts:AssumeRole"] + } +} + +resource "aws_iam_role_policy" "events_invoke_sfn" { + name = "${var.name_prefix}-events-invoke-sfn" + role = aws_iam_role.eventbridge_invoke.id + policy = jsonencode({ + Version = "2012-10-17", + Statement = [ + { + Effect = "Allow", + Action = ["states:StartExecution"], + Resource = aws_sfn_state_machine.validation.arn + } + ] + }) +} diff --git a/modules/aws-backup-validation/tsconfig.json b/modules/aws-backup-validation/tsconfig.json new file mode 100644 index 0000000..ea190c3 --- /dev/null +++ b/modules/aws-backup-validation/tsconfig.json @@ -0,0 +1,14 @@ +{ + "compilerOptions": { + "target": "ES2020", + "module": "ES2020", + "moduleResolution": "Node", + "esModuleInterop": true, + "forceConsistentCasingInFileNames": true, + "strict": true, + "skipLibCheck": true, + "outDir": "dist" + }, + "include": ["src/**/*.ts"], + "exclude": ["node_modules"] +} diff --git a/modules/aws-backup-validation/variables.tf b/modules/aws-backup-validation/variables.tf new file mode 100644 index 0000000..6319bb2 --- /dev/null +++ b/modules/aws-backup-validation/variables.tf @@ -0,0 +1,52 @@ +variable "enable" { + description = "Whether to deploy restore validation components." + type = bool + default = true +} + +variable "name_prefix" { + description = "Prefix for created resources (state machine, lambda, etc)." + type = string + default = "backup-restore-validation" +} + +variable "restore_testing_plan_arn" { + description = "ARN of the AWS Backup Restore Testing Plan to filter events." + type = string +} + +variable "resource_types" { + description = "List of resource types we will attempt to validate (e.g. [\"RDS\", \"Aurora\", \"DynamoDB\", \"S3\"])." + type = list(string) + default = [] +} + +variable "validation_config_json" { + description = "Raw JSON string of validation configuration to be stored in SSM Parameter for the Lambda validator." + type = string + default = "{}" +} + +variable "lambda_runtime" { + description = "Runtime for validator lambda." + type = string + default = "nodejs20.x" +} + +variable "lambda_timeout" { + description = "Timeout in seconds for validator lambda." + type = number + default = 60 +} + +variable "log_retention_days" { + description = "CloudWatch log retention for validator lambda." + type = number + default = 14 +} + +variable "tags" { + description = "Tags to apply to created resources." + type = map(string) + default = {} +} diff --git a/modules/aws-backup-validation/versions.tf b/modules/aws-backup-validation/versions.tf new file mode 100644 index 0000000..7f163ea --- /dev/null +++ b/modules/aws-backup-validation/versions.tf @@ -0,0 +1,9 @@ +terraform { + required_version = ">= 1.5.0" + required_providers { + aws = { + source = "hashicorp/aws" + version = ">= 5.0" + } + } +}