Skip to content

Comments

ENG-893 Implement AWS Backup Restore Validation Module#76

Open
terado wants to merge 2 commits intomainfrom
NIMI9-ENG-893-restore-validation-design
Open

ENG-893 Implement AWS Backup Restore Validation Module#76
terado wants to merge 2 commits intomainfrom
NIMI9-ENG-893-restore-validation-design

Conversation

@terado
Copy link
Contributor

@terado terado commented Sep 19, 2025

Description

This PR introduces the initial implementation of automated post-restore validation capabilities for AWS Backup restore testing within the blueprint. It delivers:

  • A new Terraform module: modules/aws-backup-validation that provisions:
    • EventBridge rule filtering Restore Job State Change events (status = COMPLETED) for a configured restore testing plan.
    • Step Functions Standard state machine orchestrating validation steps (enrich → route → invoke validator/skip → publish result).
    • Validator Lambda (extensible Python placeholder) loading config from SSM Parameter Store.
    • SSM Parameter to hold user-supplied validation configuration (raw JSON for now).
    • IAM roles/policies (currently permissive placeholders pending least-privilege tightening).
    • CloudWatch Log Group for Lambda and state machine logging.
  • Design documentation updates in docs/restore-testing-design.md referencing the new flow and module.
  • A PlantUML sequence diagram: docs/diagrams/restore-validation-sequence.puml depicting the validation workflow.
  • Documentation alignment and markdown lint fixes across the design artefact.

This forms the foundation for blueprint consumers to define integrity checks (e.g. RDS SQL assertions, DynamoDB item sampling, S3 object/head checks) executed automatically after restore test completion, reporting status back via PutRestoreValidationResult.

Context

Restore testing previously verified only that a resource could be restored; it did not validate functional integrity. The added module and design formalise an event-driven validation pipeline so implementers can define resource-type specific checks declaratively. This aligns with the stated requirement to “test integrity of restored resources” and enable customer-defined SQL or data-level assertions. The implementation is intentionally scaffold-grade: operational hooks and IAM scoping will be refined later, but the architecture and invocation path are now runnable and extensible.

Key motivations:

  • Close the gap between “restorable” and “usable” backups.
  • Provide a pluggable validation pattern without forcing custom per-team orchestration code.
  • Establish a consistent interface for future metrics, alerting, and additional resource validators.

Type of changes

  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would change existing functionality)
  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I am familiar with the contributing guidelines
  • I have followed the code style of the project (Terraform & markdown conventions observed)
  • I have added tests to cover my changes (N/A: infrastructure scaffold; future validator logic will introduce test harness)
  • I have updated the documentation accordingly (restore-testing-design.md, new diagram, module README)
  • This PR is a result of pair or mob programming

Sensitive Information Declaration

To ensure the utmost confidentiality and protect privacy, no PII/PID or other sensitive data has been added. All examples are generic (non-production identifiers, no secrets).

  • I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

Additional Notes

  • IAM policies are deliberately broad for the prototype; follow-up PR will scope actions and resources per enabled validator.
  • Validator Lambda currently returns placeholder success/skip outcomes; real integrity logic (RDS rds-data queries, DynamoDB keyed checks, S3 manifest comparisons) will be added incrementally.
  • Module interface presently accepts a raw JSON config string (validation_config_json); future enhancement may promote a typed Terraform object schema.
  • Sequence diagram can be rendered via any PlantUML-compatible plugin from docs/diagrams/restore-validation-sequence.puml.

…d Step Functions

ENG-893 Add restore validation sequence diagram and update documentation
@terado terado requested a review from a team as a code owner September 19, 2025 13:13
@terado terado requested a review from mannickutd September 19, 2025 13:14
![end-to-end visual of the event-driven validation workflow](diagrams/restore-validation-sequence.png)

```text
AWS Backup Restore Testing Plan (scheduled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we want to enforce scheduled

AWS Backup Restore Testing Plan (scheduled)
│ (runs restore jobs)
Restore Test Jobs (Test restore of latest/random recovery points)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specified restore points


### Why Step Functions?

- Orchestrates retries, parallel fan-out per restored resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidates the distributed architecture into an execution platform detailing the lifecycle of the event.

@regularfry
Copy link
Contributor

This is good as far as it goes, but it doesn't obviously support cross-resource validation. If the validation question I want to answer is "having restored both, does every S3 path listed in this dynamodb table actually exist in that bucket, and vice versa?", is that an S3 validation or a dynamo validation? How would we support that in this framework?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants