AD2000X · AD2000X · Jun 3, 2026 · Jun 3, 2026
diff --git a/DEVLOG.md b/DEVLOG.md
@@ -181,6 +181,32 @@ Decisions outgrow this file, split them into `DECISIONS.md` (or `docs/adr/`).
 
 ---
 
+## 2026-06-03 - Phase 4 eval-summary backbone (PR-A)
+
+### Result - one summary aggregated from the per-phase artifacts; report numbers never hand-copied
+
+- **What landed:** `src/phase4_summary.py` (pure summarizers + layout aggregation + markdown
+  render, no file/Drive/gradio IO), `scripts/build_phase4_summary.py` (reads the five metrics
+  JSONs + three layout CSVs, writes `outputs/evaluation/phase4_summary.json` gitignored +
+  `reports/phase4_metrics.md` committed), `tests/test_phase4_summary.py` (10 synthetic tests).
+  See `docs/phase4_brief.md`.
+- **Phase 2 has no metrics JSON**, so the builder aggregates it inline from the staged
+  `diagnostic_pos.csv` / `diagnostic_neg.csv` / `smoke_structure.csv`, matching the table-level
+  matching + FP definitions in `scripts/eval_layout_iou.py` and the OK/WARN split in
+  `scripts/smoke_structure.py`. This reproduced the prior DEVLOG layout numbers **exactly** (mean
+  crop IoU 0.900; matched@0.50 0.900/0.916; matched@0.75 0.880/0.895; crop->TATR 285/286 = 0.997),
+  confirming the inline path needs no Colab re-run.
+- **No-drift gate:** `render_metrics_markdown` is pure and deterministic and the file is written
+  with LF; rebuilding leaves `reports/phase4_metrics.md` byte-identical, so the committed report
+  snippet cannot silently drift from the artifacts.
+- **Reporting choices:** retrieval reports hit@{1,5,10} + MRR@10 only (recall@k == hit@k under one
+  relevant chunk per question, `src/eval_retrieval.py`); a missing artifact degrades to
+  `{"available": false}` rather than failing.
+- **Result:** full `pytest` green (246, +10). Headline echoes: FUNSD `test_50.qa_links` F1 0.727;
+  QA `gt_markdown` answer_exact 0.675. PR-B (report) and PR-C (Gradio demo) follow.
+
+---
+
 ## 2026-06-03 - Phase 3 FUNSD relation-linking baseline (V1)
 
 ### Result - annotation-only spatial heuristic; high precision, recall is the design ceiling

diff --git a/PLAN.md b/PLAN.md
@@ -544,7 +544,16 @@ Implementation details:
 
 **Phases 0 through 3 are complete and merged** (v1 = table-only RAG; Phase 2 = DocLayNet
 layout-crop integration; Phase 3 = FUNSD relation baseline, both merged to `main`
-2026-06-03). **Phase 4 (full demo + evaluation + report) is the next phase.**
+2026-06-03). **Phase 4 (full demo + evaluation + report) is in progress** on
+`feature/phase4-demo-eval-report`; PR-A (the eval-summary backbone) has landed.
+
+Phase 4 PR-A delivered (capstone summary backbone; see `docs/phase4_brief.md`):
+`src/phase4_summary.py` (pure per-phase summarizers + inline layout-CSV aggregation + markdown
+render), `scripts/build_phase4_summary.py` (writes `outputs/evaluation/phase4_summary.json` and
+the committed `reports/phase4_metrics.md`), `tests/test_phase4_summary.py` (10 synthetic tests).
+Report numbers are generated from the summary (never hand-copied), guarded by a no-drift gate.
+Next: PR-B (`reports/final_report.md` + `notebooks/07_final_report.ipynb`) and PR-C
+(`scripts/run_demo.py` + `notebooks/06_demo.ipynb`, key-optional Gradio demo).
 
 Phase 3 V1 delivered (annotation-only deterministic relation baseline; see
 `docs/phase3_brief.md`): `src/funsd_extraction.py` (parse + dedupe + per-answer-argmax

diff --git a/README.md b/README.md
@@ -47,16 +47,17 @@ pytest
 
 ## Status
 
-Phases 0 through 1C are complete; the v1 release (table-only RAG) is merged to `main`.
-Delivered: the repo foundation; Phase 1A table topology (TATR grid derivation,
-spanning-cell mapping, grid validation, occupancy-aware HTML parsing); Phase 1B OCR
-content extraction (word-to-cell assignment, financial number normalization, content
-metrics); and Phase 1C table-only RAG (BM25 + dense BGE cosine + RRF retrieval, one
-chunk per table, single-provider grounded answer generation, GT-filled vs OCR-filled
-corpora scored separately).
-
-Current branch: Phase 2 (DocLayNet layout integration) is the active follow-up:
-page-level region detection -> table crop -> the existing Phase 1A/1B pipeline.
-The layout-crop MVP gate is implemented and scored on fixed DocLayNet subsets; the
-remaining close-out is the full crop->TATR structure smoke rerun after the tightened
-empty-grid validator. See [PLAN.md](PLAN.md) for the phase roadmap.
+**Phases 0 through 3 are complete and merged to `main`.** Delivered: the repo foundation;
+Phase 1A table topology (TATR grid derivation, spanning-cell mapping, grid validation,
+occupancy-aware HTML parsing); Phase 1B OCR content extraction (word-to-cell assignment,
+financial number normalization, content metrics); Phase 1C table-only RAG (BM25 + dense
+BGE cosine + RRF retrieval, one chunk per table, single-provider grounded answer
+generation, GT-filled vs OCR-filled corpora scored separately); Phase 2 DocLayNet
+layout-crop integration (page-level region detection -> table crop -> the Phase 1A/1B
+pipeline); and Phase 3 FUNSD relation-linking baseline (annotation-only deterministic
+predictor, held-out `test_50.qa_links` F1 0.727).
+
+Current phase: Phase 4 (full demo + evaluation + report) is in progress on
+`feature/phase4-demo-eval-report` — a capstone that aggregates the per-phase metrics into
+one summary, a key-optional Gradio demo, and a written report. See [PLAN.md](PLAN.md) for
+the phase roadmap.
diff --git a/docs/phase4_brief.md b/docs/phase4_brief.md
@@ -0,0 +1,104 @@
+# Phase 4 — Demo + Eval Summary + Final Report (capstone)
+
+> Implementation brief for Phase 4. Committed in the repo (travels with `git pull` to Colab) so
+> the references to it in `DEVLOG.md` and the `src/phase4_summary.py` /
+> `scripts/build_phase4_summary.py` docstrings resolve. Status: PR-A (the eval-summary backbone)
+> implemented on `feature/phase4-demo-eval-report` — `src/phase4_summary.py`,
+> `scripts/build_phase4_summary.py`, `tests/test_phase4_summary.py`, and the generated
+> `reports/phase4_metrics.md`. PR-B (report) and PR-C (demo) follow.
+
+## Context
+
+Phases 0-3 are merged to `main` (FinTabNet.c table topology + OCR content + table-only RAG +
+DocLayNet layout + FUNSD relations). Phase 4 is the **capstone**: make the work presentable,
+reportable, and reproducible. It is explicitly **not new research** — it assembles the existing
+deterministic/custom metrics into one summary, a Gradio demo, and a written report.
+GriTS/Ragas/DeepEval are future work.
+
+All Drive evaluation artifacts are staged locally under `outputs/` (gitignored): metrics JSONs,
+layout CSVs, the RAG chunk corpus, QA sets, table outputs, crops/regions. FUNSD raw is at
+`data/raw/funsd/`.
+
+## Locked decisions
+
+- **Assemble, don't research.** GriTS / Ragas / DeepEval = future work, never a Phase 4 gate.
+- **Report is the product; notebooks are runners** (P1/P2): aggregation in `.py`; notebooks only
+  pull branch + run a script + display tables/figures.
+- **Demo is artifact-backed**, not live PDF -> layout -> TATR -> OCR -> RAG. The only live piece
+  is a QA box doing retrieval + answer generation over the **existing** chunk corpus.
+- **Notebook numbering 06/07** (contiguous). **Entrypoint `scripts/run_demo.py`** (runners live
+  in `scripts/`; no root `app.py` unless HF Spaces later).
+- **Demo degrades gracefully on two independent axes.** (a) *Retrieval stack:* default to
+  **BM25 retrieval-only** (pure CPU, no model); enable dense + RRF only when the embedding stack
+  is importable (a key-less reviewer may also lack a GPU). (b) *Answer generation:* gated solely
+  by `OPENROUTER_API_KEY` (disabled tab + key-missing message when absent). The demo must fully
+  launch with **neither**.
+- **Report metrics generated from the summary, never hand-copied.** The builder emits
+  `phase4_summary.json` and a paste-ready markdown table; report numbers read from the table.
+- **Commit policy:** `reports/phase4_metrics.md` is committed (generated report snippet);
+  `outputs/evaluation/phase4_summary.json` stays gitignored under `outputs/`. The no-drift gate
+  checks `reports/phase4_metrics.md` is byte-identical after a rebuild (the builder writes LF).
+- **Retrieval reported as hit@1 / hit@5 / hit@10 + MRR@10 only.** With one relevant chunk per
+  question `recall@k == hit@k` (`src/eval_retrieval.py`), so recall@k is dropped from the report.
+
+## Input artifacts (all verified present)
+
+| Source | File | Headline keys |
+|---|---|---|
+| 1A topology | `outputs/evaluation/phase1a_topology_<run-id>.json` | row/col_count_accuracy, cell_occupancy_f1, spanning_cell_detection_rate (n=300) |
+| 1B content | `outputs/evaluation/phase1b_content_<run-id>.json` | `aggregate` / `one_to_one` / `topology_matched_subset` cell metrics |
+| 1C retrieval | `outputs/evaluation/rag/phase1c_retrieval.json` | corpora x {bm25,dense,rrf} x hit@{1,5,10}, mrr@10 |
+| 1C QA | `outputs/evaluation/rag/phase1c_qa.json` | configs x {answer_exact, numeric_relaxed, citation_hit, abstain_rate} — GT vs OCR |
+| 2 layout | `outputs/layout/diagnostic_pos.csv` + `diagnostic_neg.csv` + `smoke_structure.csv` | mean crop IoU, matched@0.50/0.75, table-free FP rate, crop->TATR OK rate |
+| 3 FUNSD | `outputs/evaluation/phase3_funsd_relations.json` | `primary`="test_50.qa_links"; results[split][scope] P/R/F1 |
+
+Default deliverable run-id is `mvp_rand` (Phase 1A/1B). **Phase 2 has no JSON**; the builder
+aggregates it inline from the staged CSVs (no Colab re-run), matching the table-level matching +
+FP definitions printed by `scripts/eval_layout_iou.py` (and `scripts/smoke_structure.py` for the
+crop->TATR OK/WARN split). The inline aggregation reproduces the DEVLOG layout numbers exactly
+(mean crop IoU 0.900; matched@0.50 0.900/0.916; matched@0.75 0.880/0.895; crop->TATR 285/286).
+
+## Files
+
+- `src/phase4_summary.py` (new) — pure helpers, no file/Drive/gradio IO: `summarize_topology` /
+  `_content` / `_retrieval` (drops recall@k) / `_qa` / `_funsd` (headline from the JSON's own
+  `primary` pointer); `layout_metrics_from_rows(pos, neg, smoke)` (aggregation over parsed CSV
+  rows); `build_summary(parts)` (missing part -> `{"available": false}`);
+  `render_metrics_markdown(summary)` (deterministic paste-ready table). Style of
+  `src/eval_funsd.py`.
+- `scripts/build_phase4_summary.py` (new) — reads the five JSONs + three layout CSVs, calls the
+  pure helpers, writes `outputs/evaluation/phase4_summary.json` (gitignored) +
+  `reports/phase4_metrics.md` (committed, LF). Graceful on a missing artifact.
+- `scripts/run_demo.py` (new, PR-C) — Gradio app; `gradio` imported inside the script only (never
+  from `src/` or tests); BM25 retrieval default, dense/RRF only if the embedding stack imports;
+  answer generation gated by `OPENROUTER_API_KEY`. Reuses `src/retrieval.py`, `src/llm_client.py`.
+  Tabs: Overview, Table QA, Table Extraction, Layout, FUNSD Relations, Limitations.
+- `notebooks/06_demo.ipynb` (PR-C), `notebooks/07_final_report.ipynb` (PR-B) — Colab runners.
+- `reports/final_report.md` (PR-B) — methodology, metrics (generated-from-summary), GT-vs-OCR
+  separation, limitations, future work, "reproduce in this order".
+- `tests/test_phase4_summary.py` (new) — synthetic fixtures only (P3).
+- Docs: `README.md` status refresh (no stale "Phase 2 active"); `DEVLOG.md` entry; `PLAN.md` §7.
+
+## Out of scope (future work)
+GriTS; Ragas / DeepEval; full-document (non-table) chunking; chart/figure extraction;
+cross-encoder reranker / learned query routing; live PDF -> pipeline; HF Spaces deploy.
+
+## Verification / gates
+1. **Unit:** `pytest tests/test_phase4_summary.py` green, then full `pytest` green — synthetic,
+   no Drive/network, no gradio.
+2. **Summary build:** `python scripts/build_phase4_summary.py` writes the JSON + markdown; numbers
+   match the sources (FUNSD test_50.qa F1 0.727; QA gt_markdown answer_exact 0.675; layout
+   matched@0.50 recall 0.900).
+3. **No-drift:** re-running the builder leaves `reports/phase4_metrics.md` byte-identical.
+4. **Demo:** `scripts/run_demo.py` launches in the degraded case (no key, no embedding stack) and
+   the full case.
+5. **Report:** `reports/final_report.md` exists; README has no stale Phase 2 wording.
+
+## Build order (TDD) + PR boundaries
+- **PR-A (core, done):** tests -> `src/phase4_summary.py` -> `scripts/build_phase4_summary.py` ->
+  generated `reports/phase4_metrics.md`; + README/DEVLOG/PLAN docs.
+- **PR-B (report):** `reports/final_report.md` + `notebooks/07_final_report.ipynb`.
+- **PR-C (demo):** `scripts/run_demo.py` + `notebooks/06_demo.ipynb`.
+
+## Branch
+`feature/phase4-demo-eval-report` cut from the latest `origin/main` after `git fetch`.
diff --git a/reports/phase4_metrics.md b/reports/phase4_metrics.md
@@ -0,0 +1,69 @@
+<!-- generated by scripts/build_phase4_summary.py - do not edit by hand -->
+
+# Phase 4 metrics summary
+
+Generated from `outputs/evaluation/phase4_summary.json`; do not edit by hand.
+
+## Table extraction (Phase 1A topology, Phase 1B content)
+| topology metric | value (n=300) |
+|---|---|
+| row count accuracy | 0.790 |
+| col count accuracy | 0.987 |
+| cell occupancy F1 | 0.977 |
+| spanning cell detection | 0.957 |
+
+| content (cell-level) | exact | numeric | non-empty F1 |
+|---|---|---|---|
+| aggregate (n=300) | 0.804 | 0.876 | 0.977 |
+| one-to-one (IoU>=0.5) | 0.761 | 0.826 | 0.906 |
+| topology-matched (n=234) | 0.819 | 0.902 | 0.988 |
+
+mean alignment IoU (one-to-one): 0.877
+
+## Layout (Phase 2 DocLayNet crop)
+| layout metric | value |
+|---|---|
+| mean crop IoU (GT-table pages) | 0.900 |
+| matched@0.50 (recall / precision) | 0.900 / 0.916 |
+| matched@0.75 (recall / precision) | 0.880 / 0.895 |
+| table-free crop FP rate | 0.065 |
+| crop -> TATR OK rate | 0.997 (285/286) |
+
+## Retrieval (Phase 1C, table chunks)
+| corpus (n=30) | method | hit@1 | hit@5 | hit@10 | MRR@10 |
+|---|---|---|---|---|---|
+| gt_linearized | bm25 | 0.933 | 1.000 | 1.000 | 0.958 |
+| gt_linearized | dense | 0.667 | 0.900 | 0.933 | 0.749 |
+| gt_linearized | rrf | 0.833 | 0.933 | 1.000 | 0.892 |
+| gt_markdown | bm25 | 0.900 | 0.933 | 0.967 | 0.917 |
+| gt_markdown | dense | 0.633 | 0.800 | 0.833 | 0.699 |
+| gt_markdown | rrf | 0.800 | 0.900 | 0.967 | 0.839 |
+| ocr_linearized | bm25 | 0.933 | 1.000 | 1.000 | 0.958 |
+| ocr_linearized | dense | 0.767 | 0.933 | 0.933 | 0.816 |
+| ocr_linearized | rrf | 0.867 | 0.967 | 1.000 | 0.910 |
+| ocr_markdown | bm25 | 0.900 | 0.933 | 0.967 | 0.917 |
+| ocr_markdown | dense | 0.667 | 0.733 | 0.867 | 0.712 |
+| ocr_markdown | rrf | 0.733 | 0.933 | 0.933 | 0.807 |
+
+## Table QA (Phase 1C, answer generation)
+| config (n=46) | answer exact | numeric relaxed | citation hit | abstain rate |
+|---|---|---|---|---|
+| gt_linearized | 0.650 | 0.875 | 0.825 | 0.000 |
+| gt_markdown | 0.675 | 0.775 | 0.800 | 0.025 |
+| ocr_linearized | 0.575 | 0.800 | 0.850 | 0.050 |
+| ocr_markdown | 0.550 | 0.700 | 0.750 | 0.050 |
+
+## FUNSD relations (Phase 3)
+headline (test_50.qa_links): P 0.946 / R 0.590 / F1 0.727
+
+| split | scope | precision | recall | f1 |
+|---|---|---|---|---|
+| all_199 | all_links | 0.925 | 0.401 | 0.560 |
+| all_199 | qa_links | 0.925 | 0.535 | 0.678 |
+| debug_20 | all_links | 0.944 | 0.293 | 0.447 |
+| debug_20 | qa_links | 0.944 | 0.363 | 0.524 |
+| test_50 | all_links | 0.946 | 0.464 | 0.623 |
+| test_50 | qa_links | 0.946 | 0.590 | 0.727 |
+| train_149 | all_links | 0.919 | 0.385 | 0.543 |
+| train_149 | qa_links | 0.919 | 0.521 | 0.665 |
+
diff --git a/scripts/build_phase4_summary.py b/scripts/build_phase4_summary.py
@@ -0,0 +1,97 @@
+#!/usr/bin/env python3
+"""Build the Phase 4 capstone summary from the per-phase evaluation artifacts.
+
+Reads the five metrics JSONs + the three Phase 2 layout CSVs from outputs/, aggregates them with
+the pure helpers in src/phase4_summary.py, and writes:
+  - outputs/evaluation/phase4_summary.json   (gitignored machine artifact)
+  - reports/phase4_metrics.md                (committed, paste-ready; the report reads these)
+A missing artifact degrades gracefully (its section is marked unavailable). Layout has no JSON, so
+it is aggregated inline from diagnostic_pos.csv / diagnostic_neg.csv / smoke_structure.csv. The
+markdown is written with LF newlines so the no-drift gate holds across Windows and Colab. See
+docs/phase4_brief.md.
+
+Usage:
+    python scripts/build_phase4_summary.py [--run-id mvp_rand]
+"""
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+import argparse
+import csv
+import json
+
+from src import config
+from src import phase4_summary as p4
+
+
+def _load_json(path: Path):
+    return json.loads(path.read_text(encoding="utf-8")) if path.exists() else None
+
+
+def _load_csv(path: Path):
+    if not path.exists():
+        return None
+    with path.open(newline="", encoding="utf-8") as f:
+        return list(csv.DictReader(f))
+
+
+def _layout_part(layout_dir: Path):
+    """Aggregate the three staged layout CSVs; None unless all are present."""
+    pos = _load_csv(layout_dir / "diagnostic_pos.csv")
+    neg = _load_csv(layout_dir / "diagnostic_neg.csv")
+    smoke = _load_csv(layout_dir / "smoke_structure.csv")
+    if pos is None or neg is None or smoke is None:
+        return None
+    return p4.layout_metrics_from_rows(pos, neg, smoke)
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser(description="Build the Phase 4 capstone summary.")
+    ap.add_argument("--run-id", default="mvp_rand",
+                    help="run-id suffix of the Phase 1A/1B deliverable artifacts")
+    args = ap.parse_args()
+
+    ev = config.EVALUATION
+    rag = ev / "rag"
+    topo = _load_json(ev / f"phase1a_topology_{args.run_id}.json")
+    content = _load_json(ev / f"phase1b_content_{args.run_id}.json")
+    retr = _load_json(rag / "phase1c_retrieval.json")
+    qa = _load_json(rag / "phase1c_qa.json")
+    funsd = _load_json(ev / "phase3_funsd_relations.json")
+
+    parts = {
+        "topology": p4.summarize_topology(topo) if topo else None,
+        "content": p4.summarize_content(content) if content else None,
+        "retrieval": p4.summarize_retrieval(retr) if retr else None,
+        "qa": p4.summarize_qa(qa) if qa else None,
+        "layout": _layout_part(config.LAYOUT_OUTPUT),
+        "funsd": p4.summarize_funsd(funsd) if funsd else None,
+    }
+    summary = p4.build_summary(parts)
+
+    summary_path = ev / "phase4_summary.json"
+    summary_path.parent.mkdir(parents=True, exist_ok=True)
+    summary_path.write_text(json.dumps(summary, indent=2), encoding="utf-8")
+
+    md_path = config.ROOT / "reports" / "phase4_metrics.md"
+    md_path.parent.mkdir(parents=True, exist_ok=True)
+    with md_path.open("w", encoding="utf-8", newline="") as f:   # newline="": LF verbatim
+        f.write(p4.render_metrics_markdown(summary))
+
+    print("Phase 4 summary - artifact availability:")
+    for name in p4.PHASES:
+        print(f"  {name:<10} {'OK' if summary[name].get('available') else 'MISSING'}")
+    if summary["funsd"].get("available"):
+        h = summary["funsd"]["headline"]
+        print(f"\nFUNSD headline ({summary['funsd']['primary']}): "
+              f"P {h['precision']:.3f} / R {h['recall']:.3f} / F1 {h['f1']:.3f}")
+    print(f"\nsummary -> {summary_path}")
+    print(f"metrics  -> {md_path}")
+
+
+if __name__ == "__main__":
+    main()