Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions DEVLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,32 @@ Decisions outgrow this file, split them into `DECISIONS.md` (or `docs/adr/`).

---

## 2026-06-03 - Phase 4 eval-summary backbone (PR-A)

### Result - one summary aggregated from the per-phase artifacts; report numbers never hand-copied

- **What landed:** `src/phase4_summary.py` (pure summarizers + layout aggregation + markdown
render, no file/Drive/gradio IO), `scripts/build_phase4_summary.py` (reads the five metrics
JSONs + three layout CSVs, writes `outputs/evaluation/phase4_summary.json` gitignored +
`reports/phase4_metrics.md` committed), `tests/test_phase4_summary.py` (10 synthetic tests).
See `docs/phase4_brief.md`.
- **Phase 2 has no metrics JSON**, so the builder aggregates it inline from the staged
`diagnostic_pos.csv` / `diagnostic_neg.csv` / `smoke_structure.csv`, matching the table-level
matching + FP definitions in `scripts/eval_layout_iou.py` and the OK/WARN split in
`scripts/smoke_structure.py`. This reproduced the prior DEVLOG layout numbers **exactly** (mean
crop IoU 0.900; matched@0.50 0.900/0.916; matched@0.75 0.880/0.895; crop->TATR 285/286 = 0.997),
confirming the inline path needs no Colab re-run.
- **No-drift gate:** `render_metrics_markdown` is pure and deterministic and the file is written
with LF; rebuilding leaves `reports/phase4_metrics.md` byte-identical, so the committed report
snippet cannot silently drift from the artifacts.
- **Reporting choices:** retrieval reports hit@{1,5,10} + MRR@10 only (recall@k == hit@k under one
relevant chunk per question, `src/eval_retrieval.py`); a missing artifact degrades to
`{"available": false}` rather than failing.
- **Result:** full `pytest` green (246, +10). Headline echoes: FUNSD `test_50.qa_links` F1 0.727;
QA `gt_markdown` answer_exact 0.675. PR-B (report) and PR-C (Gradio demo) follow.

---

## 2026-06-03 - Phase 3 FUNSD relation-linking baseline (V1)

### Result - annotation-only spatial heuristic; high precision, recall is the design ceiling
Expand Down
11 changes: 10 additions & 1 deletion PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -544,7 +544,16 @@ Implementation details:

**Phases 0 through 3 are complete and merged** (v1 = table-only RAG; Phase 2 = DocLayNet
layout-crop integration; Phase 3 = FUNSD relation baseline, both merged to `main`
2026-06-03). **Phase 4 (full demo + evaluation + report) is the next phase.**
2026-06-03). **Phase 4 (full demo + evaluation + report) is in progress** on
`feature/phase4-demo-eval-report`; PR-A (the eval-summary backbone) has landed.

Phase 4 PR-A delivered (capstone summary backbone; see `docs/phase4_brief.md`):
`src/phase4_summary.py` (pure per-phase summarizers + inline layout-CSV aggregation + markdown
render), `scripts/build_phase4_summary.py` (writes `outputs/evaluation/phase4_summary.json` and
the committed `reports/phase4_metrics.md`), `tests/test_phase4_summary.py` (10 synthetic tests).
Report numbers are generated from the summary (never hand-copied), guarded by a no-drift gate.
Next: PR-B (`reports/final_report.md` + `notebooks/07_final_report.ipynb`) and PR-C
(`scripts/run_demo.py` + `notebooks/06_demo.ipynb`, key-optional Gradio demo).

Phase 3 V1 delivered (annotation-only deterministic relation baseline; see
`docs/phase3_brief.md`): `src/funsd_extraction.py` (parse + dedupe + per-answer-argmax
Expand Down
27 changes: 14 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,16 +47,17 @@ pytest

## Status

Phases 0 through 1C are complete; the v1 release (table-only RAG) is merged to `main`.
Delivered: the repo foundation; Phase 1A table topology (TATR grid derivation,
spanning-cell mapping, grid validation, occupancy-aware HTML parsing); Phase 1B OCR
content extraction (word-to-cell assignment, financial number normalization, content
metrics); and Phase 1C table-only RAG (BM25 + dense BGE cosine + RRF retrieval, one
chunk per table, single-provider grounded answer generation, GT-filled vs OCR-filled
corpora scored separately).

Current branch: Phase 2 (DocLayNet layout integration) is the active follow-up:
page-level region detection -> table crop -> the existing Phase 1A/1B pipeline.
The layout-crop MVP gate is implemented and scored on fixed DocLayNet subsets; the
remaining close-out is the full crop->TATR structure smoke rerun after the tightened
empty-grid validator. See [PLAN.md](PLAN.md) for the phase roadmap.
**Phases 0 through 3 are complete and merged to `main`.** Delivered: the repo foundation;
Phase 1A table topology (TATR grid derivation, spanning-cell mapping, grid validation,
occupancy-aware HTML parsing); Phase 1B OCR content extraction (word-to-cell assignment,
financial number normalization, content metrics); Phase 1C table-only RAG (BM25 + dense
BGE cosine + RRF retrieval, one chunk per table, single-provider grounded answer
generation, GT-filled vs OCR-filled corpora scored separately); Phase 2 DocLayNet
layout-crop integration (page-level region detection -> table crop -> the Phase 1A/1B
pipeline); and Phase 3 FUNSD relation-linking baseline (annotation-only deterministic
predictor, held-out `test_50.qa_links` F1 0.727).

Current phase: Phase 4 (full demo + evaluation + report) is in progress on
`feature/phase4-demo-eval-report` — a capstone that aggregates the per-phase metrics into
one summary, a key-optional Gradio demo, and a written report. See [PLAN.md](PLAN.md) for
the phase roadmap.
104 changes: 104 additions & 0 deletions docs/phase4_brief.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Phase 4 — Demo + Eval Summary + Final Report (capstone)

> Implementation brief for Phase 4. Committed in the repo (travels with `git pull` to Colab) so
> the references to it in `DEVLOG.md` and the `src/phase4_summary.py` /
> `scripts/build_phase4_summary.py` docstrings resolve. Status: PR-A (the eval-summary backbone)
> implemented on `feature/phase4-demo-eval-report` — `src/phase4_summary.py`,
> `scripts/build_phase4_summary.py`, `tests/test_phase4_summary.py`, and the generated
> `reports/phase4_metrics.md`. PR-B (report) and PR-C (demo) follow.

## Context

Phases 0-3 are merged to `main` (FinTabNet.c table topology + OCR content + table-only RAG +
DocLayNet layout + FUNSD relations). Phase 4 is the **capstone**: make the work presentable,
reportable, and reproducible. It is explicitly **not new research** — it assembles the existing
deterministic/custom metrics into one summary, a Gradio demo, and a written report.
GriTS/Ragas/DeepEval are future work.

All Drive evaluation artifacts are staged locally under `outputs/` (gitignored): metrics JSONs,
layout CSVs, the RAG chunk corpus, QA sets, table outputs, crops/regions. FUNSD raw is at
`data/raw/funsd/`.

## Locked decisions

- **Assemble, don't research.** GriTS / Ragas / DeepEval = future work, never a Phase 4 gate.
- **Report is the product; notebooks are runners** (P1/P2): aggregation in `.py`; notebooks only
pull branch + run a script + display tables/figures.
- **Demo is artifact-backed**, not live PDF -> layout -> TATR -> OCR -> RAG. The only live piece
is a QA box doing retrieval + answer generation over the **existing** chunk corpus.
- **Notebook numbering 06/07** (contiguous). **Entrypoint `scripts/run_demo.py`** (runners live
in `scripts/`; no root `app.py` unless HF Spaces later).
- **Demo degrades gracefully on two independent axes.** (a) *Retrieval stack:* default to
**BM25 retrieval-only** (pure CPU, no model); enable dense + RRF only when the embedding stack
is importable (a key-less reviewer may also lack a GPU). (b) *Answer generation:* gated solely
by `OPENROUTER_API_KEY` (disabled tab + key-missing message when absent). The demo must fully
launch with **neither**.
- **Report metrics generated from the summary, never hand-copied.** The builder emits
`phase4_summary.json` and a paste-ready markdown table; report numbers read from the table.
- **Commit policy:** `reports/phase4_metrics.md` is committed (generated report snippet);
`outputs/evaluation/phase4_summary.json` stays gitignored under `outputs/`. The no-drift gate
checks `reports/phase4_metrics.md` is byte-identical after a rebuild (the builder writes LF).
- **Retrieval reported as hit@1 / hit@5 / hit@10 + MRR@10 only.** With one relevant chunk per
question `recall@k == hit@k` (`src/eval_retrieval.py`), so recall@k is dropped from the report.

## Input artifacts (all verified present)

| Source | File | Headline keys |
|---|---|---|
| 1A topology | `outputs/evaluation/phase1a_topology_<run-id>.json` | row/col_count_accuracy, cell_occupancy_f1, spanning_cell_detection_rate (n=300) |
| 1B content | `outputs/evaluation/phase1b_content_<run-id>.json` | `aggregate` / `one_to_one` / `topology_matched_subset` cell metrics |
| 1C retrieval | `outputs/evaluation/rag/phase1c_retrieval.json` | corpora x {bm25,dense,rrf} x hit@{1,5,10}, mrr@10 |
| 1C QA | `outputs/evaluation/rag/phase1c_qa.json` | configs x {answer_exact, numeric_relaxed, citation_hit, abstain_rate} — GT vs OCR |
| 2 layout | `outputs/layout/diagnostic_pos.csv` + `diagnostic_neg.csv` + `smoke_structure.csv` | mean crop IoU, matched@0.50/0.75, table-free FP rate, crop->TATR OK rate |
| 3 FUNSD | `outputs/evaluation/phase3_funsd_relations.json` | `primary`="test_50.qa_links"; results[split][scope] P/R/F1 |

Default deliverable run-id is `mvp_rand` (Phase 1A/1B). **Phase 2 has no JSON**; the builder
aggregates it inline from the staged CSVs (no Colab re-run), matching the table-level matching +
FP definitions printed by `scripts/eval_layout_iou.py` (and `scripts/smoke_structure.py` for the
crop->TATR OK/WARN split). The inline aggregation reproduces the DEVLOG layout numbers exactly
(mean crop IoU 0.900; matched@0.50 0.900/0.916; matched@0.75 0.880/0.895; crop->TATR 285/286).

## Files

- `src/phase4_summary.py` (new) — pure helpers, no file/Drive/gradio IO: `summarize_topology` /
`_content` / `_retrieval` (drops recall@k) / `_qa` / `_funsd` (headline from the JSON's own
`primary` pointer); `layout_metrics_from_rows(pos, neg, smoke)` (aggregation over parsed CSV
rows); `build_summary(parts)` (missing part -> `{"available": false}`);
`render_metrics_markdown(summary)` (deterministic paste-ready table). Style of
`src/eval_funsd.py`.
- `scripts/build_phase4_summary.py` (new) — reads the five JSONs + three layout CSVs, calls the
pure helpers, writes `outputs/evaluation/phase4_summary.json` (gitignored) +
`reports/phase4_metrics.md` (committed, LF). Graceful on a missing artifact.
- `scripts/run_demo.py` (new, PR-C) — Gradio app; `gradio` imported inside the script only (never
from `src/` or tests); BM25 retrieval default, dense/RRF only if the embedding stack imports;
answer generation gated by `OPENROUTER_API_KEY`. Reuses `src/retrieval.py`, `src/llm_client.py`.
Tabs: Overview, Table QA, Table Extraction, Layout, FUNSD Relations, Limitations.
- `notebooks/06_demo.ipynb` (PR-C), `notebooks/07_final_report.ipynb` (PR-B) — Colab runners.
- `reports/final_report.md` (PR-B) — methodology, metrics (generated-from-summary), GT-vs-OCR
separation, limitations, future work, "reproduce in this order".
- `tests/test_phase4_summary.py` (new) — synthetic fixtures only (P3).
- Docs: `README.md` status refresh (no stale "Phase 2 active"); `DEVLOG.md` entry; `PLAN.md` §7.

## Out of scope (future work)
GriTS; Ragas / DeepEval; full-document (non-table) chunking; chart/figure extraction;
cross-encoder reranker / learned query routing; live PDF -> pipeline; HF Spaces deploy.

## Verification / gates
1. **Unit:** `pytest tests/test_phase4_summary.py` green, then full `pytest` green — synthetic,
no Drive/network, no gradio.
2. **Summary build:** `python scripts/build_phase4_summary.py` writes the JSON + markdown; numbers
match the sources (FUNSD test_50.qa F1 0.727; QA gt_markdown answer_exact 0.675; layout
matched@0.50 recall 0.900).
3. **No-drift:** re-running the builder leaves `reports/phase4_metrics.md` byte-identical.
4. **Demo:** `scripts/run_demo.py` launches in the degraded case (no key, no embedding stack) and
the full case.
5. **Report:** `reports/final_report.md` exists; README has no stale Phase 2 wording.

## Build order (TDD) + PR boundaries
- **PR-A (core, done):** tests -> `src/phase4_summary.py` -> `scripts/build_phase4_summary.py` ->
generated `reports/phase4_metrics.md`; + README/DEVLOG/PLAN docs.
- **PR-B (report):** `reports/final_report.md` + `notebooks/07_final_report.ipynb`.
- **PR-C (demo):** `scripts/run_demo.py` + `notebooks/06_demo.ipynb`.

## Branch
`feature/phase4-demo-eval-report` cut from the latest `origin/main` after `git fetch`.
69 changes: 69 additions & 0 deletions reports/phase4_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
<!-- generated by scripts/build_phase4_summary.py - do not edit by hand -->

# Phase 4 metrics summary

Generated from `outputs/evaluation/phase4_summary.json`; do not edit by hand.

## Table extraction (Phase 1A topology, Phase 1B content)
| topology metric | value (n=300) |
|---|---|
| row count accuracy | 0.790 |
| col count accuracy | 0.987 |
| cell occupancy F1 | 0.977 |
| spanning cell detection | 0.957 |

| content (cell-level) | exact | numeric | non-empty F1 |
|---|---|---|---|
| aggregate (n=300) | 0.804 | 0.876 | 0.977 |
| one-to-one (IoU>=0.5) | 0.761 | 0.826 | 0.906 |
| topology-matched (n=234) | 0.819 | 0.902 | 0.988 |

mean alignment IoU (one-to-one): 0.877

## Layout (Phase 2 DocLayNet crop)
| layout metric | value |
|---|---|
| mean crop IoU (GT-table pages) | 0.900 |
| matched@0.50 (recall / precision) | 0.900 / 0.916 |
| matched@0.75 (recall / precision) | 0.880 / 0.895 |
| table-free crop FP rate | 0.065 |
| crop -> TATR OK rate | 0.997 (285/286) |

## Retrieval (Phase 1C, table chunks)
| corpus (n=30) | method | hit@1 | hit@5 | hit@10 | MRR@10 |
|---|---|---|---|---|---|
| gt_linearized | bm25 | 0.933 | 1.000 | 1.000 | 0.958 |
| gt_linearized | dense | 0.667 | 0.900 | 0.933 | 0.749 |
| gt_linearized | rrf | 0.833 | 0.933 | 1.000 | 0.892 |
| gt_markdown | bm25 | 0.900 | 0.933 | 0.967 | 0.917 |
| gt_markdown | dense | 0.633 | 0.800 | 0.833 | 0.699 |
| gt_markdown | rrf | 0.800 | 0.900 | 0.967 | 0.839 |
| ocr_linearized | bm25 | 0.933 | 1.000 | 1.000 | 0.958 |
| ocr_linearized | dense | 0.767 | 0.933 | 0.933 | 0.816 |
| ocr_linearized | rrf | 0.867 | 0.967 | 1.000 | 0.910 |
| ocr_markdown | bm25 | 0.900 | 0.933 | 0.967 | 0.917 |
| ocr_markdown | dense | 0.667 | 0.733 | 0.867 | 0.712 |
| ocr_markdown | rrf | 0.733 | 0.933 | 0.933 | 0.807 |

## Table QA (Phase 1C, answer generation)
| config (n=46) | answer exact | numeric relaxed | citation hit | abstain rate |
|---|---|---|---|---|
| gt_linearized | 0.650 | 0.875 | 0.825 | 0.000 |
| gt_markdown | 0.675 | 0.775 | 0.800 | 0.025 |
| ocr_linearized | 0.575 | 0.800 | 0.850 | 0.050 |
| ocr_markdown | 0.550 | 0.700 | 0.750 | 0.050 |

## FUNSD relations (Phase 3)
headline (test_50.qa_links): P 0.946 / R 0.590 / F1 0.727

| split | scope | precision | recall | f1 |
|---|---|---|---|---|
| all_199 | all_links | 0.925 | 0.401 | 0.560 |
| all_199 | qa_links | 0.925 | 0.535 | 0.678 |
| debug_20 | all_links | 0.944 | 0.293 | 0.447 |
| debug_20 | qa_links | 0.944 | 0.363 | 0.524 |
| test_50 | all_links | 0.946 | 0.464 | 0.623 |
| test_50 | qa_links | 0.946 | 0.590 | 0.727 |
| train_149 | all_links | 0.919 | 0.385 | 0.543 |
| train_149 | qa_links | 0.919 | 0.521 | 0.665 |

97 changes: 97 additions & 0 deletions scripts/build_phase4_summary.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
#!/usr/bin/env python3
"""Build the Phase 4 capstone summary from the per-phase evaluation artifacts.

Reads the five metrics JSONs + the three Phase 2 layout CSVs from outputs/, aggregates them with
the pure helpers in src/phase4_summary.py, and writes:
- outputs/evaluation/phase4_summary.json (gitignored machine artifact)
- reports/phase4_metrics.md (committed, paste-ready; the report reads these)
A missing artifact degrades gracefully (its section is marked unavailable). Layout has no JSON, so
it is aggregated inline from diagnostic_pos.csv / diagnostic_neg.csv / smoke_structure.csv. The
markdown is written with LF newlines so the no-drift gate holds across Windows and Colab. See
docs/phase4_brief.md.

Usage:
python scripts/build_phase4_summary.py [--run-id mvp_rand]
"""
from __future__ import annotations

import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

import argparse
import csv
import json

from src import config
from src import phase4_summary as p4


def _load_json(path: Path):
return json.loads(path.read_text(encoding="utf-8")) if path.exists() else None


def _load_csv(path: Path):
if not path.exists():
return None
with path.open(newline="", encoding="utf-8") as f:
return list(csv.DictReader(f))


def _layout_part(layout_dir: Path):
"""Aggregate the three staged layout CSVs; None unless all are present."""
pos = _load_csv(layout_dir / "diagnostic_pos.csv")
neg = _load_csv(layout_dir / "diagnostic_neg.csv")
smoke = _load_csv(layout_dir / "smoke_structure.csv")
if pos is None or neg is None or smoke is None:
return None
return p4.layout_metrics_from_rows(pos, neg, smoke)


def main() -> None:
ap = argparse.ArgumentParser(description="Build the Phase 4 capstone summary.")
ap.add_argument("--run-id", default="mvp_rand",
help="run-id suffix of the Phase 1A/1B deliverable artifacts")
args = ap.parse_args()

ev = config.EVALUATION
rag = ev / "rag"
topo = _load_json(ev / f"phase1a_topology_{args.run_id}.json")
content = _load_json(ev / f"phase1b_content_{args.run_id}.json")
retr = _load_json(rag / "phase1c_retrieval.json")
qa = _load_json(rag / "phase1c_qa.json")
funsd = _load_json(ev / "phase3_funsd_relations.json")

parts = {
"topology": p4.summarize_topology(topo) if topo else None,
"content": p4.summarize_content(content) if content else None,
"retrieval": p4.summarize_retrieval(retr) if retr else None,
"qa": p4.summarize_qa(qa) if qa else None,
"layout": _layout_part(config.LAYOUT_OUTPUT),
"funsd": p4.summarize_funsd(funsd) if funsd else None,
}
summary = p4.build_summary(parts)

summary_path = ev / "phase4_summary.json"
summary_path.parent.mkdir(parents=True, exist_ok=True)
summary_path.write_text(json.dumps(summary, indent=2), encoding="utf-8")

md_path = config.ROOT / "reports" / "phase4_metrics.md"
md_path.parent.mkdir(parents=True, exist_ok=True)
with md_path.open("w", encoding="utf-8", newline="") as f: # newline="": LF verbatim
f.write(p4.render_metrics_markdown(summary))

print("Phase 4 summary - artifact availability:")
for name in p4.PHASES:
print(f" {name:<10} {'OK' if summary[name].get('available') else 'MISSING'}")
if summary["funsd"].get("available"):
h = summary["funsd"]["headline"]
print(f"\nFUNSD headline ({summary['funsd']['primary']}): "
f"P {h['precision']:.3f} / R {h['recall']:.3f} / F1 {h['f1']:.3f}")
print(f"\nsummary -> {summary_path}")
print(f"metrics -> {md_path}")


if __name__ == "__main__":
main()
Loading
Loading