feat: Phase 4 PR-A eval-summary backbone#21
Conversation
- src/phase4_summary.py: pure per-phase summarizers + inline layout-CSV aggregation + deterministic markdown render (no file/Drive/gradio IO) - scripts/build_phase4_summary.py: 5 metrics JSONs + 3 layout CSVs -> outputs/evaluation/phase4_summary.json (gitignored) + reports/phase4_metrics.md - tests/test_phase4_summary.py: 10 synthetic tests (full pytest 246 green) - reports/phase4_metrics.md: generated metrics snapshot (no-drift, LF) - docs/phase4_brief.md: committed implementation brief - README/DEVLOG/PLAN: Phase 4 in progress; README stale "Phase 2 active" removed reports/phase4_metrics.md is generated; outputs/evaluation/phase4_summary.json remains gitignored.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6d15fadc18
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| def summarize_qa(d: dict) -> dict: | ||
| configs = {name: {k: m[k] for k in _QA_KEYS} for name, m in d["configs"].items()} | ||
| return {"n": d["num_questions"], "configs": configs} |
There was a problem hiding this comment.
Report QA metrics with the answerable denominator
When phase1c_qa.json includes the current manual/unanswerable seed, aggregate_qa computes answer_exact, numeric_relaxed, citation_hit, and abstain_rate only over answerable questions while the top-level num_questions still includes the 6 unanswerable rows. By collapsing the summary to n = d["num_questions"] and dropping each config's num_answerable, the generated report labels these answerable-only rates as n=46 instead of n=40, so the Phase 4 metric table misstates the denominator. Please carry num_answerable (or both counts) through the summary/render path.
Useful? React with 👍 / 👎.
outputs/evaluation/phase4_summary.json (gitignored) + reports/phase4_metrics.md
reports/phase4_metrics.md is generated; outputs/evaluation/phase4_summary.json remains gitignored.