feat: Phase 4 PR-A eval-summary backbone by AD2000X · Pull Request #21 · AD2000X/FinDocStructRAG

AD2000X · 2026-06-03T14:51:11Z

src/phase4_summary.py: pure per-phase summarizers + inline layout-CSV aggregation + deterministic markdown render (no file/Drive/gradio IO)
scripts/build_phase4_summary.py: 5 metrics JSONs + 3 layout CSVs ->
outputs/evaluation/phase4_summary.json (gitignored) + reports/phase4_metrics.md
tests/test_phase4_summary.py: 10 synthetic tests (full pytest 246 green)
reports/phase4_metrics.md: generated metrics snapshot (no-drift, LF)
docs/phase4_brief.md: committed implementation brief
README/DEVLOG/PLAN: Phase 4 in progress; README stale "Phase 2 active" removed

reports/phase4_metrics.md is generated; outputs/evaluation/phase4_summary.json remains gitignored.

- src/phase4_summary.py: pure per-phase summarizers + inline layout-CSV aggregation + deterministic markdown render (no file/Drive/gradio IO) - scripts/build_phase4_summary.py: 5 metrics JSONs + 3 layout CSVs -> outputs/evaluation/phase4_summary.json (gitignored) + reports/phase4_metrics.md - tests/test_phase4_summary.py: 10 synthetic tests (full pytest 246 green) - reports/phase4_metrics.md: generated metrics snapshot (no-drift, LF) - docs/phase4_brief.md: committed implementation brief - README/DEVLOG/PLAN: Phase 4 in progress; README stale "Phase 2 active" removed reports/phase4_metrics.md is generated; outputs/evaluation/phase4_summary.json remains gitignored.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d15fadc18

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-03T14:54:19Z

+
+def summarize_qa(d: dict) -> dict:
+    configs = {name: {k: m[k] for k in _QA_KEYS} for name, m in d["configs"].items()}
+    return {"n": d["num_questions"], "configs": configs}


Report QA metrics with the answerable denominator

When phase1c_qa.json includes the current manual/unanswerable seed, aggregate_qa computes answer_exact, numeric_relaxed, citation_hit, and abstain_rate only over answerable questions while the top-level num_questions still includes the 6 unanswerable rows. By collapsing the summary to n = d["num_questions"] and dropping each config's num_answerable, the generated report labels these answerable-only rates as n=46 instead of n=40, so the Phase 4 metric table misstates the denominator. Please carry num_answerable (or both counts) through the summary/render path.

Useful? React with 👍 / 👎.

AD2000X merged commit 9fbc67f into main Jun 3, 2026
1 check passed

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

AD2000X deleted the feature/phase4-demo-eval-report branch June 3, 2026 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase 4 PR-A eval-summary backbone#21

feat: Phase 4 PR-A eval-summary backbone#21
AD2000X merged 1 commit into
mainfrom
feature/phase4-demo-eval-report

AD2000X commented Jun 3, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AD2000X commented Jun 3, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant