Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ venv/
.mcp.json
tmp/
memory/
plans/

# Scratch screenshots (never committed)
output*.png
50 changes: 50 additions & 0 deletions DEVLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,56 @@ Decisions outgrow this file, split them into `DECISIONS.md` (or `docs/adr/`).

---

## 2026-06-03 - Phase 3 FUNSD relation-linking baseline (V1)

### Result - annotation-only spatial heuristic; high precision, recall is the design ceiling

First Phase 3 deliverable: a deterministic FUNSD relation-linking baseline over GT entities,
CPU-only and annotation-only (the FUNSD JSON carries entity text/bbox/label and the GT
`linking` pairs, so no image pixels are loaded). Run on the real dataset (149 train + 50 test
= 199 forms), `scripts/evaluate_funsd.py`, untuned a-priori params:

| split | scope | precision | recall | f1 | tp / pred / gold | n |
| --- | --- | --- | --- | --- | --- | --- |
| **test_50** | **qa_links** | **0.946** | **0.590** | **0.727** | 494 / 522 / 837 | 50 |
| all_199 | qa_links | 0.925 | 0.535 | 0.678 | 2123 / 2295 / 3966 | 199 |
| test_50 | all_links | 0.946 | 0.464 | 0.623 | 494 / 522 / 1064 | 50 |
| all_199 | all_links | 0.925 | 0.401 | 0.560 | 2123 / 2295 / 5293 | 199 |
| train_149 | qa_links | 0.919 | 0.521 | 0.665 | 1629 / 1773 / 3129 | 149 |

Reading it honestly:
- **Headline (held-out): `test_50.qa_links.micro_f1` = 0.727**, precision 0.946. The heuristic
fires conservatively and is right when it does; the limit is recall.
- **Recall (0.590) is the design ceiling, not a bug.** Per-answer argmax emits at most one link
per answer, and the geometry only models same-row right-side and below relations - so answers
whose question sits left/above, or that have multiple gold questions, are under-covered. The
rejected alternatives (per-question argmax, global threshold) trade this for precision; richer
matching (threshold-based multi-link) is the documented next lever, deliberately out of V1.
- **No tuning-on-test risk.** Params are a-priori defaults; `train_149` F1 (0.665) is *below*
`test_50` (0.727), so test is if anything the easier split - the gap is sampling, not fitting.
- **`all_links` is a coverage diagnostic, not a second predictor.** Same QA predictions scored
(as undirected pairs) against every GT link; recall necessarily drops (0.464 test) because the
QA-only heuristic cannot cover header->question and other link types. `all_199` carries the
"contains the 50 test + 149 tuned forms, not held-out" caveat in the report JSON.

Design (locked in discussion; see `plans/phase3-funsd-relations.md`):
- **Predictor:** per-answer argmax + distance gate; distances normalized by the form's median
entity height; two separate knobs (`max_distance_units` distance gate, `min_score` floor).
- **GT links:** deduped to undirected frozensets (FUNSD records links bidirectionally), then
`qa_gold_links` canonicalizes question+answer to directed `(q,a)`; `all_gold_links` keeps the
full undirected set.
- **Reporting matrix:** primary `test_50.qa_links` (held-out); secondary `all_199.qa_links`,
`test_50/all_199.all_links`; `train_149` is the dev/tuning split.
- **No sklearn in V1** (P/R/F1 is set arithmetic). **No image loading** (optional later for
qualitative overlay/debug only, never in the baseline or the gate).
- **Scope held:** standalone branch; does not touch the RAG pipeline. FUNSD token classification
(V2 / seqeval) is future work.
- **Files:** `src/funsd_extraction.py`, `src/eval_funsd.py`, `scripts/evaluate_funsd.py`,
`scripts/fetch_funsd.py`, `tests/test_funsd_relations.py` (17 synthetic tests, the gate),
`src/config.py` (FUNSD paths). Full suite 236 passed.

---

## 2026-06-02 - Phase 2 DocLayNet layout-crop MVP gate

### Finding - Aryn primary carries forward; fallback is narrow; crop->structure needs band dedup
Expand Down
33 changes: 20 additions & 13 deletions PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -542,19 +542,26 @@ Implementation details:

## 7. Next steps

**Phases 0 through 1C are complete and merged** (v1 = table-only RAG). The active branch is
**Phase 2 (DocLayNet layout integration)**. The detector is pinned, the pure geometry/layout
modules are tested, the fixed DocLayNet MVP subset is scored, and the crop->TATR structure
handoff is validated on sampled crops.

Remaining Phase 2 close-out:

1. Pull the branch on Colab and rerun Step 7d with the tightened empty-grid validator
(`scripts/smoke_structure.py --n 286 --seed 42`). Expected: 285 OK / 1 WARN, still under
the <=5% WARN gate.
2. Record the confirmed Step 7d result in `DEVLOG.md`, then open the Phase 2 PR.
3. After merge, repin Colab notebooks to `main`; do not start Phase 3 until the Phase 2 PR is
merged or explicitly paused.
**Phases 0 through 2 are complete and merged** (v1 = table-only RAG; Phase 2 = DocLayNet
layout-crop integration, merged to `main` 2026-06-03). The active branch is
**Phase 3 (FUNSD relation branch)**, `feature/phase3-funsd-relations` off `main`.

Phase 3 V1 is implemented and scored (entirely local, CPU-only, no Colab):

1. Annotation-only deterministic relation baseline: `src/funsd_extraction.py` (parse + dedupe
+ per-answer-argmax predictor), `src/eval_funsd.py` (set-based P/R/F1),
`scripts/evaluate_funsd.py` (+ `scripts/fetch_funsd.py`), `tests/test_funsd_relations.py`
(17 synthetic tests). Full suite 236 passed.
2. Headline (held-out `test_50.qa_links`): P 0.946 / R 0.590 / **F1 0.727**; secondaries in
`DEVLOG.md` (2026-06-03) and `outputs/evaluation/phase3_funsd_relations.json`.

Remaining:

1. Open the Phase 3 PR.
2. Optional (train-only): tune `HeuristicParams` on `train_149` if higher recall is wanted;
never on `test_50`. FUNSD token classification (V2 / seqeval) and threshold-based multi-link
matching are future work, not V1.
3. Phase 4 (full demo + evaluation + report) is the next phase.

---

Expand Down
254 changes: 254 additions & 0 deletions notebooks/05_phase3_funsd_relations.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Phase 3 - FUNSD relation baseline (Colab runner)\n",
"\n",
"Runner only: mount Drive, pull the Phase 3 branch, fetch the raw FUNSD annotations, run the synthetic unit gate, then run the real FUNSD relation evaluation.\n",
"\n",
"Phase 3 is annotation-only and CPU-only. The FUNSD JSON carries entity text, bbox, label, and GT linking pairs, so this notebook does not load image pixels and does not need a GPU. Logic lives in `src/` and `scripts/`, not in this notebook.\n",
"\n",
"Before running in Colab, make sure `feature/phase3-funsd-relations` has been pushed to GitHub. After Phase 3 merges, set `BRANCH = 'main'` in the boot cell."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Boot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 1. Mount Drive so config.DATA_ROOT and config.OUTPUT_ROOT persist across Colab sessions.\n",
"from google.colab import drive\n",
"drive.mount('/content/drive')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 2. Get the code onto the VM and pin the Phase 3 branch.\n",
"import os\n",
"\n",
"REPO = '/content/FinDocStructRAG'\n",
"BRANCH = 'feature/phase3-funsd-relations' # change to 'main' after Phase 3 merges\n",
"\n",
"if not os.path.isdir(f'{REPO}/.git'):\n",
" !git clone --quiet https://github.com/AD2000X/FinDocStructRAG.git {REPO}\n",
"\n",
"!cd {REPO} && git fetch origin --quiet\n",
"!cd {REPO} && git checkout {BRANCH} && git pull --ff-only origin {BRANCH}\n",
"!cd {REPO} && echo branch: $(git rev-parse --abbrev-ref HEAD) HEAD: $(git log --oneline -1)\n",
"%cd /content/FinDocStructRAG"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 3. Make src/ importable and sanity-check the Phase 3 paths.\n",
"import importlib\n",
"import sys\n",
"\n",
"sys.path.insert(0, '/content/FinDocStructRAG')\n",
"from src import config\n",
"importlib.reload(config)\n",
"\n",
"print('IN_COLAB :', config.IN_COLAB)\n",
"print('DATA_ROOT :', config.DATA_ROOT)\n",
"print('OUTPUT_ROOT :', config.OUTPUT_ROOT)\n",
"print('FUNSD_ROOT :', config.FUNSD_ROOT)\n",
"print('FUNSD_TRAIN :', config.FUNSD_TRAIN)\n",
"print('FUNSD_TEST :', config.FUNSD_TEST)\n",
"print('EVALUATION :', config.EVALUATION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1 - fetch or reuse FUNSD annotations"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Downloads the official FUNSD zip only if it is not already present on Drive.\n",
"# It extracts to data/raw/funsd/dataset/...; tests never depend on this data.\n",
"!python scripts/fetch_funsd.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Dataset count sanity check. Expected: 149 train + 50 test annotations.\n",
"import importlib\n",
"from src import config\n",
"importlib.reload(config)\n",
"\n",
"n_train = len(list(config.FUNSD_TRAIN.glob('*.json')))\n",
"n_test = len(list(config.FUNSD_TEST.glob('*.json')))\n",
"print('train annotations:', n_train)\n",
"print('test annotations :', n_test)\n",
"assert n_train == 149 and n_test == 50, 'Unexpected FUNSD annotation counts'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 - unit gate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Small dependency for the synthetic acceptance tests. The Phase 3 runtime itself is stdlib-only.\n",
"!python -m pip install -q pytest\n",
"!python -m pytest tests/test_funsd_relations.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3 - run Phase 3 evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python scripts/evaluate_funsd.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load the JSON report and display the split x scope matrix.\n",
"import json\n",
"from pathlib import Path\n",
"\n",
"from src import config\n",
"\n",
"report_path = config.EVALUATION / 'phase3_funsd_relations.json'\n",
"report = json.loads(report_path.read_text(encoding='utf-8'))\n",
"print('report:', report_path)\n",
"print('primary:', report['primary'])\n",
"print('note :', report['note'])\n",
"\n",
"rows = []\n",
"for split, scopes in report['results'].items():\n",
" for scope, m in scopes.items():\n",
" rows.append({\n",
" 'split': split,\n",
" 'scope': scope,\n",
" 'precision': round(m['precision'], 3),\n",
" 'recall': round(m['recall'], 3),\n",
" 'f1': round(m['f1'], 3),\n",
" 'tp': m['tp'],\n",
" 'pred': m['n_pred'],\n",
" 'gold': m['n_gold'],\n",
" 'forms': m['num_forms'],\n",
" })\n",
"\n",
"try:\n",
" import pandas as pd\n",
" display(pd.DataFrame(rows))\n",
"except Exception:\n",
" for row in rows:\n",
" print(row)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 - optional error peek"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show the lowest-recall held-out forms. This is read-only and does not write artifacts.\n",
"from src.funsd_extraction import load_funsd_split, predict_qa_links, qa_gold_links\n",
"from src import config\n",
"\n",
"test_forms = load_funsd_split(config.FUNSD_TEST)\n",
"miss_rows = []\n",
"for form in test_forms:\n",
" pred = predict_qa_links(form)\n",
" gold = qa_gold_links(form)\n",
" tp = len(pred & gold)\n",
" recall = tp / len(gold) if gold else 0.0\n",
" precision = tp / len(pred) if pred else 0.0\n",
" miss_rows.append({\n",
" 'form_id': form['form_id'],\n",
" 'precision': round(precision, 3),\n",
" 'recall': round(recall, 3),\n",
" 'tp': tp,\n",
" 'pred': len(pred),\n",
" 'gold': len(gold),\n",
" })\n",
"\n",
"miss_rows = sorted(miss_rows, key=lambda r: (r['recall'], r['precision'], r['form_id']))[:10]\n",
"try:\n",
" import pandas as pd\n",
" display(pd.DataFrame(miss_rows))\n",
"except Exception:\n",
" for row in miss_rows:\n",
" print(row)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading