Skip to content

docs: add agentic CI plan for automated PR reviews and daily maintenance#473

Draft
andreatgretel wants to merge 7 commits intomainfrom
andreatgretel/feat/agentic-ci
Draft

docs: add agentic CI plan for automated PR reviews and daily maintenance#473
andreatgretel wants to merge 7 commits intomainfrom
andreatgretel/feat/agentic-ci

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel commented Mar 30, 2026

📋 Summary

Plan for adding an agentic CI layer to the repo: GitHub Actions workflows that run Claude Code or Codex on a self-hosted runner to review PRs and run daily tech debt maintenance. Closes #472.

🔄 Changes

✨ Added

  • plans/472/agentic-ci-plan.md - full plan covering:
    • Recipe format and directory layout (.agents/recipes/)
    • PR review recipe composing the existing review-code skill
    • Five daily maintenance suites rotating Mon-Fri: docs-and-references, dependencies, structure, code-quality, test-health
    • Runner memory for dedup and delta tracking across runs
    • Security model: collaborator-only triggers, minimal permissions, prompt injection mitigations
    • Phased rollout (4 phases)

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • plans/472/agentic-ci-plan.md - This is a plan-only PR (no implementation). Review for feasibility, security concerns, and whether the suite coverage and phasing make sense for the team.

🤖 Generated with AI

eric-tramel
eric-tramel previously approved these changes Mar 30, 2026
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent initiative, let's do it 🚀


Constraints:
- Only runs on non-draft PRs
- Skips if the PR only touches docs/markdown (configurable per recipe)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have agent workflow reviews in this case, too (e.g. keeping docs in sync, making sure the edits are faithful to the codebase.

Comment on lines +182 to +186
Running all suites every day is technically possible (stagger them hourly, e.g.,
06:00-10:00 UTC) and would surface issues faster. But it creates up to five
PR/issue streams per day, which risks becoming noise the team tunes out - the
opposite of the goal. One suite per day keeps the output digestible and gives
each finding proper attention.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have trust in the system, especially crossing the Rubicon on automated merging of simple edits (e.g. fix a doc, etc), then increasing the rate of review will be valuable.


#### Wednesday / structure

Enforces the multi-package layering that makes DataDesigner work.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also include the import / cli bootup time checks that @nabinchha had added previously.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a test for it already, so it should be covered...?

| Wed | structure | import boundaries, circular deps, dead exports |
| Thu | code-quality | complexity, exception hygiene, type gaps, TODO aging |
| Fri | test-health | coverage deltas, hollow tests, fixtures, smoke tests |
| Sat/Sun | off | - |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reserve Sat/Sun for longer performance benchmarking and AI-QA tests. This isn't fully built out yet, but that will be the natural thing to add to the automation once it is (e.g. measuring mocked execution times, memory overhead, hotspot detection).

Additionally, AI-QA test would be: let the agent go through and try to construct SDG workflows and then execute them, record friction and problems.

@nabinchha
Copy link
Copy Markdown
Contributor

Nice work on this one, @andreatgretel — this is a thorough and well-structured plan. Here are my thoughts.

Summary

This PR adds a comprehensive plan for introducing agentic CI to DataDesigner: GitHub Actions workflows that run Claude Code or Codex on a self-hosted runner to perform automated PR reviews and rotating daily maintenance audits. The plan covers architecture (recipe format, directory layout), security (prompt injection, minimal permissions), phased rollout, and runner memory for cross-run dedup. The implementation matches the stated intent in the PR description and closes #472.

Findings

Warnings — Worth addressing

plans/472/agentic-ci-plan.md:189 — PR review skipping docs/markdown PRs may miss important changes

  • What: The PR review workflow constraint says "Skips if the PR only touches docs/markdown (configurable per recipe)." For a project that treats documentation as a first-class artifact (architecture docs, AGENTS.md, STYLEGUIDE.md, skills), skipping agent review on docs-only PRs could miss broken cross-references, stale guidance, or inconsistencies with code.
  • Why: Eric's inline comment on this line raises a related point — agent reviews should also verify docs stay in sync with the codebase. Skipping docs PRs entirely works against that goal.
  • Suggestion: Default to reviewing docs PRs as well, but with a lighter recipe variant (skip linting, focus on link validity and consistency with code). The skip behavior could be reserved for trivial changes like typo fixes, gated by a label (e.g., skip-agent-review) rather than by file type.

plans/472/agentic-ci-plan.md:365-369 — Memory committed to a branch creates merge friction

  • What: The plan proposes committing runner memory (.agents/memory/) to a long-lived agentic-ci/state branch that rebases on main before each run.
  • Why: Rebasing a branch with frequent automated commits against an active main branch is a common source of CI failures — merge conflicts in the JSON state file, failed rebases when main moves fast, and noisy commit history. The state branch also needs its own CI exemption (no lint, no tests on memory commits) or it'll trigger the full pipeline on every update.
  • Suggestion: Could we evaluate GitHub Actions cache or a simple artifact-based approach as the primary storage, with the committed branch as a fallback for auditability? The plan already mentions this as an alternative — it might be worth making it the default and keeping the branch approach as the transparent-audit option. Alternatively, a single runner-state.json file on main updated via a dedicated bot PR (squashed, auto-merged) would avoid the long-lived branch entirely.

Suggestions — Take it or leave it

plans/472/agentic-ci-plan.md:569-575 — Open questions could address cost controls

  • What: The open questions section covers flaky tests and dry-run mode, but doesn't mention cost/budget controls — a practical concern for daily agent runs against paid model APIs.
  • Why: Each suite run consumes tokens. Without a budget cap or cost tracking, a runaway recipe (e.g., one that reads the entire codebase on every run) could generate unexpected bills. The audit trail section mentions logging token usage, but there's no discussion of limits or alerts.
  • Suggestion: Add an open question about cost guardrails: per-run token budget, monthly spend alerts, or automatic recipe disabling if cost exceeds a threshold. This is especially relevant for Phase 2+ when five suites run weekly.

plans/472/agentic-ci-plan.md:1-6 — Frontmatter could include a status field

  • What: The plan frontmatter has date and authors but no status indicator. Other plans in the repo (e.g., plans/427/) follow the same pattern.
  • Why: As the plan moves through phases, it would be useful to know at a glance whether it's "proposed", "accepted", "in-progress", or "completed" without reading the full document or checking the PR state.
  • Suggestion: Consider adding status: proposed to the frontmatter. This is a minor convention that could be adopted across all plans if the team finds it useful.

What Looks Good

  • Skill composition model is exactly right. The decision to have recipes invoke existing skills rather than duplicating review logic is the strongest design choice in the plan. It means the review-code skill remains the single source of truth, and improvements flow to both interactive and CI usage automatically. The clear separation — "recipes own when/how, skills own what" — is clean and maintainable.

  • Security section is unusually thorough for a plan document. The prompt injection surface analysis with per-input-type risk/mitigation, the explicit pull_request vs pull_request_target guidance, and the YOLO mode hardening constraints show real thought about the threat model. This will serve as a solid reference when implementing the workflows.

  • The rotation rationale is well-argued. Rather than just stating "one suite per day," the plan explains why (noise management, attention budget) and then lists four concrete alternatives with trade-offs for each. This makes it easy for the team to revisit the decision later with full context.

Verdict

Needs changes — The memory storage approach and the docs-skip behavior are worth resolving before merge. None of these require major restructuring — they're refinements to an already solid plan.

| Wed | structure | import boundaries, circular deps, dead exports |
| Thu | code-quality | complexity, exception hygiene, type gaps, TODO aging |
| Fri | test-health | coverage deltas, hollow tests, fixtures, smoke tests |
| Sat/Sun | off | - |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weekends seem like a good time for agents to be busy. Something to think about as we evolve this.

Keeps the dependency graph healthy and secure.

- **Version pinning audit**: compare pinned versions in all three `pyproject.toml`
files against latest available. Prefer strict pins (`==`) over loose (`>=`) for
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strict pins are tricky, since we also want to balance with UX/DX, though the recent litellm issue should give us pause


---

### Suite Details
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed suites cover code and docs, which is great. Perhaps we should also create a suite for repo maintenance like analyzing open issues and PRs and creating a report for us to review each week. Eventually, we'll add issue solving to the list too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: agentic CI - automated PR reviews and scheduled maintenance

4 participants