Skip to content

feat: Phase 4 PR-A eval-summary backbone#21

Merged
AD2000X merged 1 commit into
mainfrom
feature/phase4-demo-eval-report
Jun 3, 2026
Merged

feat: Phase 4 PR-A eval-summary backbone#21
AD2000X merged 1 commit into
mainfrom
feature/phase4-demo-eval-report

Conversation

@AD2000X
Copy link
Copy Markdown
Owner

@AD2000X AD2000X commented Jun 3, 2026

  • src/phase4_summary.py: pure per-phase summarizers + inline layout-CSV aggregation + deterministic markdown render (no file/Drive/gradio IO)
  • scripts/build_phase4_summary.py: 5 metrics JSONs + 3 layout CSVs ->
    outputs/evaluation/phase4_summary.json (gitignored) + reports/phase4_metrics.md
  • tests/test_phase4_summary.py: 10 synthetic tests (full pytest 246 green)
  • reports/phase4_metrics.md: generated metrics snapshot (no-drift, LF)
  • docs/phase4_brief.md: committed implementation brief
  • README/DEVLOG/PLAN: Phase 4 in progress; README stale "Phase 2 active" removed

reports/phase4_metrics.md is generated; outputs/evaluation/phase4_summary.json remains gitignored.

- src/phase4_summary.py: pure per-phase summarizers + inline layout-CSV
  aggregation + deterministic markdown render (no file/Drive/gradio IO)
- scripts/build_phase4_summary.py: 5 metrics JSONs + 3 layout CSVs ->
  outputs/evaluation/phase4_summary.json (gitignored) + reports/phase4_metrics.md
- tests/test_phase4_summary.py: 10 synthetic tests (full pytest 246 green)
- reports/phase4_metrics.md: generated metrics snapshot (no-drift, LF)
- docs/phase4_brief.md: committed implementation brief
- README/DEVLOG/PLAN: Phase 4 in progress; README stale "Phase 2 active" removed

reports/phase4_metrics.md is generated; outputs/evaluation/phase4_summary.json remains gitignored.
@AD2000X AD2000X merged commit 9fbc67f into main Jun 3, 2026
1 check passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d15fadc18

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/phase4_summary.py

def summarize_qa(d: dict) -> dict:
configs = {name: {k: m[k] for k in _QA_KEYS} for name, m in d["configs"].items()}
return {"n": d["num_questions"], "configs": configs}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report QA metrics with the answerable denominator

When phase1c_qa.json includes the current manual/unanswerable seed, aggregate_qa computes answer_exact, numeric_relaxed, citation_hit, and abstain_rate only over answerable questions while the top-level num_questions still includes the 6 unanswerable rows. By collapsing the summary to n = d["num_questions"] and dropping each config's num_answerable, the generated report labels these answerable-only rates as n=46 instead of n=40, so the Phase 4 metric table misstates the denominator. Please carry num_answerable (or both counts) through the summary/render path.

Useful? React with 👍 / 👎.

@AD2000X AD2000X deleted the feature/phase4-demo-eval-report branch June 3, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant