From 5c57bac59345b8066d84562e25daa578e0161c9e Mon Sep 17 00:00:00 2001
From: AD2000X <thecausticfinale@gmail.com>
Date: Wed, 3 Jun 2026 15:58:55 +0100
Subject: [PATCH] feat: Phase 4 PR-B final report + report notebook

- reports/final_report.md: capstone narrative (methodology, results read from the
  generated table, limitations, future work, reproduce-in-order); no hand-copied metrics
- notebooks/07_final_report.ipynb: Colab runner - boot, run build_phase4_summary,
  render final_report.md + generated phase4_metrics.md inline

Report prose reads from the generated reports/phase4_metrics.md; numbers are never hand-copied.
---
 notebooks/07_final_report.ipynb | 181 ++++++++++++++++++++++++++++++++
 reports/final_report.md         | 106 +++++++++++++++++++
 2 files changed, 287 insertions(+)
 create mode 100644 notebooks/07_final_report.ipynb
 create mode 100644 reports/final_report.md

diff --git a/notebooks/07_final_report.ipynb b/notebooks/07_final_report.ipynb
new file mode 100644
index 0000000..d46b53d
--- /dev/null
+++ b/notebooks/07_final_report.ipynb
@@ -0,0 +1,181 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Phase 4 - Final report (Colab runner)\n",
+    "\n",
+    "Runner only: mount Drive, pull the Phase 4 branch, regenerate the capstone summary from the staged\n",
+    "evaluation artifacts, then render the final report and the generated metrics table inline. Logic\n",
+    "lives in `src/` and `scripts/`, not in this notebook (P1/P2).\n",
+    "\n",
+    "The metric numbers shown below are produced by `scripts/build_phase4_summary.py`; the report prose\n",
+    "in `reports/final_report.md` never hand-copies them."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Boot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 1. Mount Drive so config.OUTPUT_ROOT (the staged evaluation artifacts) is available.\n",
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 2. Get the code onto the VM and pin the Phase 4 branch.\n",
+    "# Repin BRANCH to 'main' once the Phase 4 PRs are merged.\n",
+    "import os\n",
+    "\n",
+    "REPO = '/content/FinDocStructRAG'\n",
+    "BRANCH = 'feature/phase4-report'  # PR-B; flip to 'main' after merge\n",
+    "\n",
+    "if not os.path.isdir(f'{REPO}/.git'):\n",
+    "    !git clone --quiet https://github.com/AD2000X/FinDocStructRAG.git {REPO}\n",
+    "\n",
+    "!cd {REPO} && git fetch origin --quiet\n",
+    "!cd {REPO} && git checkout {BRANCH} && git pull --ff-only origin {BRANCH}\n",
+    "!cd {REPO} && echo branch: $(git rev-parse --abbrev-ref HEAD) HEAD: $(git log --oneline -1)\n",
+    "%cd /content/FinDocStructRAG"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 3. Make src/ importable and sanity-check the Phase 4 paths.\n",
+    "import importlib\n",
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, '/content/FinDocStructRAG')\n",
+    "from src import config\n",
+    "importlib.reload(config)\n",
+    "\n",
+    "print('IN_COLAB    :', config.IN_COLAB)\n",
+    "print('OUTPUT_ROOT :', config.OUTPUT_ROOT)\n",
+    "print('EVALUATION  :', config.EVALUATION)\n",
+    "print('LAYOUT      :', config.LAYOUT_OUTPUT)\n",
+    "print('reports dir :', config.ROOT / 'reports')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1 - build the capstone summary\n",
+    "\n",
+    "Aggregates the per-phase artifacts on Drive into `outputs/evaluation/phase4_summary.json` and the\n",
+    "committed `reports/phase4_metrics.md`. Re-running is idempotent (no-drift)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python scripts/build_phase4_summary.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2 - final report"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "from IPython.display import Markdown, display\n",
+    "\n",
+    "display(Markdown(Path('reports/final_report.md').read_text(encoding='utf-8')))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3 - generated metrics table\n",
+    "\n",
+    "The numbers cited qualitatively in the report above, generated from the summary."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "from IPython.display import Markdown, display\n",
+    "\n",
+    "display(Markdown(Path('reports/phase4_metrics.md').read_text(encoding='utf-8')))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4 - summary JSON (artifact availability + headline)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from src import config\n",
+    "\n",
+    "summary = json.loads((config.EVALUATION / 'phase4_summary.json').read_text(encoding='utf-8'))\n",
+    "print('artifact availability:')\n",
+    "for name, part in summary.items():\n",
+    "    print(f\"  {name:<10} {'OK' if part.get('available') else 'MISSING'}\")\n",
+    "\n",
+    "f = summary.get('funsd', {})\n",
+    "if f.get('available'):\n",
+    "    h = f['headline']\n",
+    "    print(f\"\n",
+    "FUNSD headline ({f['primary']}): \"\n",
+    "          f\"P {h['precision']:.3f} / R {h['recall']:.3f} / F1 {h['f1']:.3f}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/reports/final_report.md b/reports/final_report.md
new file mode 100644
index 0000000..03a9e3f
--- /dev/null
+++ b/reports/final_report.md
@@ -0,0 +1,106 @@
+# FinDocStructRAG — Final Report
+
+A layout-aware pipeline for extracting structured tables from financial-report PDFs and
+answering questions over them, plus a standalone form relation-linking baseline. This report
+is the Phase 4 capstone: it states what was built, how it was evaluated, and what the results
+mean. **All metric numbers are generated** by `scripts/build_phase4_summary.py` into
+`reports/phase4_metrics.md` and are never hand-copied into this prose;
+`notebooks/07_final_report.ipynb` renders that generated table inline beneath this report.
+
+## 1. What was built
+
+- **Phase 1A — table topology.** Table Transformer (TATR) structure recognition is turned into
+  a canonical row/column grid: spanning-cell mapping, grid validation, and occupancy-aware HTML
+  parsing.
+- **Phase 1B — content.** OCR (PaddleOCR) fills cell text; financial numbers are normalized.
+  Ground-truth-filled and OCR-filled tables are kept **strictly separate** (project rule P4):
+  GT-filled is a QA-validation reference only, never reported as an extraction output.
+- **Phase 1C — table-only RAG.** One chunk per table; retrieval is BM25 + dense BGE cosine + RRF
+  with **no LLM in the retrieval path** (P5). Answer generation is a single, swappable provider
+  behind `src/llm_client.py`; the GT-filled and OCR-filled corpora are scored separately.
+- **Phase 2 — layout.** A DocLayNet page-level detector finds table regions and crops them, which
+  then feed the Phase 1A/1B pipeline (page -> crop -> structure -> content).
+- **Phase 3 — FUNSD relations.** An annotation-only, deterministic question->answer linking
+  baseline over ground-truth entities (per-answer argmax + distance gate), scored P/R/F1 against
+  the GT links. No image pixels, no GPU.
+
+Throughout, pretrained models are used for **inference and evaluation only** — nothing is
+fine-tuned. The contribution is the layout-aware extraction + retrieval pipeline and an honest
+measurement of it.
+
+## 2. How it was evaluated
+
+- **Subset evaluation, stated as such.** Metrics are computed on fixed, seeded subsets, not a
+  whole-dataset benchmark. The goal is honest, reproducible numbers, not a leaderboard claim.
+- **Deterministic / custom metrics.** Topology, content, retrieval, QA, layout, and relation
+  metrics are all computed in-repo with explicit definitions (no GriTS / Ragas / DeepEval — those
+  are future work, see §5).
+- **GT vs OCR scored separately (P4).** The downstream answer-quality gap between a perfect
+  (GT-filled) table and the real (OCR-filled) extraction is measured directly, not averaged away.
+- **Honest held-out split for relations.** FUNSD heuristic parameters are set on `train_149`
+  only; the headline is the held-out `test_50` (no tuning on the reported set).
+- **Generated, not transcribed.** `scripts/build_phase4_summary.py` aggregates the per-phase
+  artifacts into `outputs/evaluation/phase4_summary.json` and the committed
+  `reports/phase4_metrics.md`. A no-drift check keeps the committed table byte-identical on
+  rebuild, so the reported numbers cannot silently diverge from the artifacts.
+
+## 3. Results
+
+The full, generated metric tables live in **`reports/phase4_metrics.md`** (rendered inline by
+`notebooks/07_final_report.ipynb`). Read qualitatively, the results show:
+
+- **Table structure is strong; content is the harder half.** Column topology and cell-occupancy
+  are recovered at high accuracy; OCR cell-text exactness is the lower number, as expected for a
+  recognition step over scanned financial tables.
+- **Retrieval is near-ceiling on linearized table chunks**, with sparse (BM25) competitive with
+  or ahead of dense on this corpus, and RRF robust across corpora.
+- **OCR introduces a measurable but bounded answer-quality cost.** On the QA set, the GT-filled
+  corpus answers more exactly than the OCR-filled corpus — the gap is the price of the recognition
+  step, and surfacing it is the point of the P4 separation.
+- **Layout crops are accurate and hand off cleanly to TATR**, with a low false-positive rate on
+  table-free pages and almost all crops passing the structure-validity smoke.
+- **Relation linking is high-precision; recall is the design ceiling.** The single-link,
+  geometry-only V1 recovers most question->answer pairs precisely but cannot, by construction,
+  cover multi-answer or non-geometric links.
+
+## 4. What this is not (limitations)
+
+- **Subset, not full-corpus.** Numbers describe seeded subsets; they are not whole-dataset
+  benchmarks.
+- **GT-filled is a reference, not a result.** Per P4, GT-filled tables exist to validate the QA
+  pipeline; the real extraction is OCR-filled, and the two are never conflated.
+- **FUNSD V1 is annotation-only over GT entities.** It does not detect entities and does not read
+  pixels; it is a relation baseline, not an end-to-end form parser. Single link per answer.
+- **RAG is table-only.** Full-document (non-table) text is not chunked; charts and figures are not
+  extracted (caption-level handling / future work).
+- **The demo is artifact-backed.** It serves already-produced outputs and does live retrieval +
+  answer generation over the existing chunks; it does not run a live PDF -> pipeline.
+
+## 5. Future work
+
+- **GriTS** for formal table-structure / content evaluation.
+- **Ragas / DeepEval** for RAG faithfulness and answer-quality scoring.
+- **Full-document chunking** beyond table-only RAG.
+- **Chart / figure understanding** (chart-to-table, multimodal figures).
+- **Cross-encoder reranker or learned query routing.**
+- **FUNSD V2:** token classification (entity detection) and threshold-based multi-link matching.
+
+## 6. Reproduce in this order
+
+Data and the GPU-dependent runs happen on Colab; aggregation and the demo are local/CPU. The
+per-phase notebooks (`notebooks/01`-`05`) are the runners for steps 1-6.
+
+1. **Data.** Acquire FinTabNet.c and DocLayNet (notebooks `01` / `04` boot cells); fetch FUNSD
+   with `python scripts/fetch_funsd.py`.
+2. **Phase 1A topology.** `run_phase1a_colab.py` -> `evaluate_tables.py --run-id mvp_rand`.
+3. **Phase 1B content.** `run_phase1b_gt_filled.py` + `run_phase1b_ocr_filled.py` ->
+   `evaluate_content.py --run-id mvp_rand`.
+4. **Phase 1C RAG.** `build_table_chunks.py` -> `build_qa_dataset.py` -> `evaluate_rag.py` ->
+   `evaluate_qa.py`.
+5. **Phase 2 layout.** `run_layout_batch.py` -> `eval_layout_iou.py --require-table-gt` (pos) and
+   `--exclude-table-gt` (neg) -> `smoke_structure.py`.
+6. **Phase 3 relations.** `evaluate_funsd.py`.
+7. **Capstone summary.** `python scripts/build_phase4_summary.py` ->
+   `reports/phase4_metrics.md` + `outputs/evaluation/phase4_summary.json` (this report reads the
+   former).
+8. **Demo.** `python scripts/run_demo.py` (key-optional Gradio; PR-C).