From 5c57bac59345b8066d84562e25daa578e0161c9e Mon Sep 17 00:00:00 2001 From: AD2000X Date: Wed, 3 Jun 2026 15:58:55 +0100 Subject: [PATCH] feat: Phase 4 PR-B final report + report notebook - reports/final_report.md: capstone narrative (methodology, results read from the generated table, limitations, future work, reproduce-in-order); no hand-copied metrics - notebooks/07_final_report.ipynb: Colab runner - boot, run build_phase4_summary, render final_report.md + generated phase4_metrics.md inline Report prose reads from the generated reports/phase4_metrics.md; numbers are never hand-copied. --- notebooks/07_final_report.ipynb | 181 ++++++++++++++++++++++++++++++++ reports/final_report.md | 106 +++++++++++++++++++ 2 files changed, 287 insertions(+) create mode 100644 notebooks/07_final_report.ipynb create mode 100644 reports/final_report.md diff --git a/notebooks/07_final_report.ipynb b/notebooks/07_final_report.ipynb new file mode 100644 index 0000000..d46b53d --- /dev/null +++ b/notebooks/07_final_report.ipynb @@ -0,0 +1,181 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Phase 4 - Final report (Colab runner)\n", + "\n", + "Runner only: mount Drive, pull the Phase 4 branch, regenerate the capstone summary from the staged\n", + "evaluation artifacts, then render the final report and the generated metrics table inline. Logic\n", + "lives in `src/` and `scripts/`, not in this notebook (P1/P2).\n", + "\n", + "The metric numbers shown below are produced by `scripts/build_phase4_summary.py`; the report prose\n", + "in `reports/final_report.md` never hand-copies them." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Boot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 1. Mount Drive so config.OUTPUT_ROOT (the staged evaluation artifacts) is available.\n", + "from google.colab import drive\n", + "drive.mount('/content/drive')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 2. Get the code onto the VM and pin the Phase 4 branch.\n", + "# Repin BRANCH to 'main' once the Phase 4 PRs are merged.\n", + "import os\n", + "\n", + "REPO = '/content/FinDocStructRAG'\n", + "BRANCH = 'feature/phase4-report' # PR-B; flip to 'main' after merge\n", + "\n", + "if not os.path.isdir(f'{REPO}/.git'):\n", + " !git clone --quiet https://github.com/AD2000X/FinDocStructRAG.git {REPO}\n", + "\n", + "!cd {REPO} && git fetch origin --quiet\n", + "!cd {REPO} && git checkout {BRANCH} && git pull --ff-only origin {BRANCH}\n", + "!cd {REPO} && echo branch: $(git rev-parse --abbrev-ref HEAD) HEAD: $(git log --oneline -1)\n", + "%cd /content/FinDocStructRAG" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 3. Make src/ importable and sanity-check the Phase 4 paths.\n", + "import importlib\n", + "import sys\n", + "\n", + "sys.path.insert(0, '/content/FinDocStructRAG')\n", + "from src import config\n", + "importlib.reload(config)\n", + "\n", + "print('IN_COLAB :', config.IN_COLAB)\n", + "print('OUTPUT_ROOT :', config.OUTPUT_ROOT)\n", + "print('EVALUATION :', config.EVALUATION)\n", + "print('LAYOUT :', config.LAYOUT_OUTPUT)\n", + "print('reports dir :', config.ROOT / 'reports')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1 - build the capstone summary\n", + "\n", + "Aggregates the per-phase artifacts on Drive into `outputs/evaluation/phase4_summary.json` and the\n", + "committed `reports/phase4_metrics.md`. Re-running is idempotent (no-drift)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!python scripts/build_phase4_summary.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2 - final report" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "from IPython.display import Markdown, display\n", + "\n", + "display(Markdown(Path('reports/final_report.md').read_text(encoding='utf-8')))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3 - generated metrics table\n", + "\n", + "The numbers cited qualitatively in the report above, generated from the summary." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "from IPython.display import Markdown, display\n", + "\n", + "display(Markdown(Path('reports/phase4_metrics.md').read_text(encoding='utf-8')))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4 - summary JSON (artifact availability + headline)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from src import config\n", + "\n", + "summary = json.loads((config.EVALUATION / 'phase4_summary.json').read_text(encoding='utf-8'))\n", + "print('artifact availability:')\n", + "for name, part in summary.items():\n", + " print(f\" {name:<10} {'OK' if part.get('available') else 'MISSING'}\")\n", + "\n", + "f = summary.get('funsd', {})\n", + "if f.get('available'):\n", + " h = f['headline']\n", + " print(f\"\n", + "FUNSD headline ({f['primary']}): \"\n", + " f\"P {h['precision']:.3f} / R {h['recall']:.3f} / F1 {h['f1']:.3f}\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/reports/final_report.md b/reports/final_report.md new file mode 100644 index 0000000..03a9e3f --- /dev/null +++ b/reports/final_report.md @@ -0,0 +1,106 @@ +# FinDocStructRAG — Final Report + +A layout-aware pipeline for extracting structured tables from financial-report PDFs and +answering questions over them, plus a standalone form relation-linking baseline. This report +is the Phase 4 capstone: it states what was built, how it was evaluated, and what the results +mean. **All metric numbers are generated** by `scripts/build_phase4_summary.py` into +`reports/phase4_metrics.md` and are never hand-copied into this prose; +`notebooks/07_final_report.ipynb` renders that generated table inline beneath this report. + +## 1. What was built + +- **Phase 1A — table topology.** Table Transformer (TATR) structure recognition is turned into + a canonical row/column grid: spanning-cell mapping, grid validation, and occupancy-aware HTML + parsing. +- **Phase 1B — content.** OCR (PaddleOCR) fills cell text; financial numbers are normalized. + Ground-truth-filled and OCR-filled tables are kept **strictly separate** (project rule P4): + GT-filled is a QA-validation reference only, never reported as an extraction output. +- **Phase 1C — table-only RAG.** One chunk per table; retrieval is BM25 + dense BGE cosine + RRF + with **no LLM in the retrieval path** (P5). Answer generation is a single, swappable provider + behind `src/llm_client.py`; the GT-filled and OCR-filled corpora are scored separately. +- **Phase 2 — layout.** A DocLayNet page-level detector finds table regions and crops them, which + then feed the Phase 1A/1B pipeline (page -> crop -> structure -> content). +- **Phase 3 — FUNSD relations.** An annotation-only, deterministic question->answer linking + baseline over ground-truth entities (per-answer argmax + distance gate), scored P/R/F1 against + the GT links. No image pixels, no GPU. + +Throughout, pretrained models are used for **inference and evaluation only** — nothing is +fine-tuned. The contribution is the layout-aware extraction + retrieval pipeline and an honest +measurement of it. + +## 2. How it was evaluated + +- **Subset evaluation, stated as such.** Metrics are computed on fixed, seeded subsets, not a + whole-dataset benchmark. The goal is honest, reproducible numbers, not a leaderboard claim. +- **Deterministic / custom metrics.** Topology, content, retrieval, QA, layout, and relation + metrics are all computed in-repo with explicit definitions (no GriTS / Ragas / DeepEval — those + are future work, see §5). +- **GT vs OCR scored separately (P4).** The downstream answer-quality gap between a perfect + (GT-filled) table and the real (OCR-filled) extraction is measured directly, not averaged away. +- **Honest held-out split for relations.** FUNSD heuristic parameters are set on `train_149` + only; the headline is the held-out `test_50` (no tuning on the reported set). +- **Generated, not transcribed.** `scripts/build_phase4_summary.py` aggregates the per-phase + artifacts into `outputs/evaluation/phase4_summary.json` and the committed + `reports/phase4_metrics.md`. A no-drift check keeps the committed table byte-identical on + rebuild, so the reported numbers cannot silently diverge from the artifacts. + +## 3. Results + +The full, generated metric tables live in **`reports/phase4_metrics.md`** (rendered inline by +`notebooks/07_final_report.ipynb`). Read qualitatively, the results show: + +- **Table structure is strong; content is the harder half.** Column topology and cell-occupancy + are recovered at high accuracy; OCR cell-text exactness is the lower number, as expected for a + recognition step over scanned financial tables. +- **Retrieval is near-ceiling on linearized table chunks**, with sparse (BM25) competitive with + or ahead of dense on this corpus, and RRF robust across corpora. +- **OCR introduces a measurable but bounded answer-quality cost.** On the QA set, the GT-filled + corpus answers more exactly than the OCR-filled corpus — the gap is the price of the recognition + step, and surfacing it is the point of the P4 separation. +- **Layout crops are accurate and hand off cleanly to TATR**, with a low false-positive rate on + table-free pages and almost all crops passing the structure-validity smoke. +- **Relation linking is high-precision; recall is the design ceiling.** The single-link, + geometry-only V1 recovers most question->answer pairs precisely but cannot, by construction, + cover multi-answer or non-geometric links. + +## 4. What this is not (limitations) + +- **Subset, not full-corpus.** Numbers describe seeded subsets; they are not whole-dataset + benchmarks. +- **GT-filled is a reference, not a result.** Per P4, GT-filled tables exist to validate the QA + pipeline; the real extraction is OCR-filled, and the two are never conflated. +- **FUNSD V1 is annotation-only over GT entities.** It does not detect entities and does not read + pixels; it is a relation baseline, not an end-to-end form parser. Single link per answer. +- **RAG is table-only.** Full-document (non-table) text is not chunked; charts and figures are not + extracted (caption-level handling / future work). +- **The demo is artifact-backed.** It serves already-produced outputs and does live retrieval + + answer generation over the existing chunks; it does not run a live PDF -> pipeline. + +## 5. Future work + +- **GriTS** for formal table-structure / content evaluation. +- **Ragas / DeepEval** for RAG faithfulness and answer-quality scoring. +- **Full-document chunking** beyond table-only RAG. +- **Chart / figure understanding** (chart-to-table, multimodal figures). +- **Cross-encoder reranker or learned query routing.** +- **FUNSD V2:** token classification (entity detection) and threshold-based multi-link matching. + +## 6. Reproduce in this order + +Data and the GPU-dependent runs happen on Colab; aggregation and the demo are local/CPU. The +per-phase notebooks (`notebooks/01`-`05`) are the runners for steps 1-6. + +1. **Data.** Acquire FinTabNet.c and DocLayNet (notebooks `01` / `04` boot cells); fetch FUNSD + with `python scripts/fetch_funsd.py`. +2. **Phase 1A topology.** `run_phase1a_colab.py` -> `evaluate_tables.py --run-id mvp_rand`. +3. **Phase 1B content.** `run_phase1b_gt_filled.py` + `run_phase1b_ocr_filled.py` -> + `evaluate_content.py --run-id mvp_rand`. +4. **Phase 1C RAG.** `build_table_chunks.py` -> `build_qa_dataset.py` -> `evaluate_rag.py` -> + `evaluate_qa.py`. +5. **Phase 2 layout.** `run_layout_batch.py` -> `eval_layout_iou.py --require-table-gt` (pos) and + `--exclude-table-gt` (neg) -> `smoke_structure.py`. +6. **Phase 3 relations.** `evaluate_funsd.py`. +7. **Capstone summary.** `python scripts/build_phase4_summary.py` -> + `reports/phase4_metrics.md` + `outputs/evaluation/phase4_summary.json` (this report reads the + former). +8. **Demo.** `python scripts/run_demo.py` (key-optional Gradio; PR-C).