AD2000X · AD2000X · Jun 3, 2026 · Jun 3, 2026
diff --git a/.gitignore b/.gitignore
@@ -19,6 +19,7 @@ venv/
 .mcp.json
 tmp/
 memory/
+plans/
 
 # Scratch screenshots (never committed)
 output*.png
diff --git a/DEVLOG.md b/DEVLOG.md
@@ -181,6 +181,56 @@ Decisions outgrow this file, split them into `DECISIONS.md` (or `docs/adr/`).
 
 ---
 
+## 2026-06-03 - Phase 3 FUNSD relation-linking baseline (V1)
+
+### Result - annotation-only spatial heuristic; high precision, recall is the design ceiling
+
+First Phase 3 deliverable: a deterministic FUNSD relation-linking baseline over GT entities,
+CPU-only and annotation-only (the FUNSD JSON carries entity text/bbox/label and the GT
+`linking` pairs, so no image pixels are loaded). Run on the real dataset (149 train + 50 test
+= 199 forms), `scripts/evaluate_funsd.py`, untuned a-priori params:
+
+| split | scope | precision | recall | f1 | tp / pred / gold | n |
+| --- | --- | --- | --- | --- | --- | --- |
+| **test_50** | **qa_links** | **0.946** | **0.590** | **0.727** | 494 / 522 / 837 | 50 |
+| all_199 | qa_links | 0.925 | 0.535 | 0.678 | 2123 / 2295 / 3966 | 199 |
+| test_50 | all_links | 0.946 | 0.464 | 0.623 | 494 / 522 / 1064 | 50 |
+| all_199 | all_links | 0.925 | 0.401 | 0.560 | 2123 / 2295 / 5293 | 199 |
+| train_149 | qa_links | 0.919 | 0.521 | 0.665 | 1629 / 1773 / 3129 | 149 |
+
+Reading it honestly:
+- **Headline (held-out): `test_50.qa_links.micro_f1` = 0.727**, precision 0.946. The heuristic
+  fires conservatively and is right when it does; the limit is recall.
+- **Recall (0.590) is the design ceiling, not a bug.** Per-answer argmax emits at most one link
+  per answer, and the geometry only models same-row right-side and below relations - so answers
+  whose question sits left/above, or that have multiple gold questions, are under-covered. The
+  rejected alternatives (per-question argmax, global threshold) trade this for precision; richer
+  matching (threshold-based multi-link) is the documented next lever, deliberately out of V1.
+- **No tuning-on-test risk.** Params are a-priori defaults; `train_149` F1 (0.665) is *below*
+  `test_50` (0.727), so test is if anything the easier split - the gap is sampling, not fitting.
+- **`all_links` is a coverage diagnostic, not a second predictor.** Same QA predictions scored
+  (as undirected pairs) against every GT link; recall necessarily drops (0.464 test) because the
+  QA-only heuristic cannot cover header->question and other link types. `all_199` carries the
+  "contains the 50 test + 149 tuned forms, not held-out" caveat in the report JSON.
+
+Design (locked in discussion; see `plans/phase3-funsd-relations.md`):
+- **Predictor:** per-answer argmax + distance gate; distances normalized by the form's median
+  entity height; two separate knobs (`max_distance_units` distance gate, `min_score` floor).
+- **GT links:** deduped to undirected frozensets (FUNSD records links bidirectionally), then
+  `qa_gold_links` canonicalizes question+answer to directed `(q,a)`; `all_gold_links` keeps the
+  full undirected set.
+- **Reporting matrix:** primary `test_50.qa_links` (held-out); secondary `all_199.qa_links`,
+  `test_50/all_199.all_links`; `train_149` is the dev/tuning split.
+- **No sklearn in V1** (P/R/F1 is set arithmetic). **No image loading** (optional later for
+  qualitative overlay/debug only, never in the baseline or the gate).
+- **Scope held:** standalone branch; does not touch the RAG pipeline. FUNSD token classification
+  (V2 / seqeval) is future work.
+- **Files:** `src/funsd_extraction.py`, `src/eval_funsd.py`, `scripts/evaluate_funsd.py`,
+  `scripts/fetch_funsd.py`, `tests/test_funsd_relations.py` (17 synthetic tests, the gate),
+  `src/config.py` (FUNSD paths). Full suite 236 passed.
+
+---
+
 ## 2026-06-02 - Phase 2 DocLayNet layout-crop MVP gate
 
 ### Finding - Aryn primary carries forward; fallback is narrow; crop->structure needs band dedup

diff --git a/PLAN.md b/PLAN.md
@@ -542,19 +542,26 @@ Implementation details:
 
 ## 7. Next steps
 
-**Phases 0 through 1C are complete and merged** (v1 = table-only RAG). The active branch is
-**Phase 2 (DocLayNet layout integration)**. The detector is pinned, the pure geometry/layout
-modules are tested, the fixed DocLayNet MVP subset is scored, and the crop->TATR structure
-handoff is validated on sampled crops.
-
-Remaining Phase 2 close-out:
-
-1. Pull the branch on Colab and rerun Step 7d with the tightened empty-grid validator
-   (`scripts/smoke_structure.py --n 286 --seed 42`). Expected: 285 OK / 1 WARN, still under
-   the <=5% WARN gate.
-2. Record the confirmed Step 7d result in `DEVLOG.md`, then open the Phase 2 PR.
-3. After merge, repin Colab notebooks to `main`; do not start Phase 3 until the Phase 2 PR is
-   merged or explicitly paused.
+**Phases 0 through 2 are complete and merged** (v1 = table-only RAG; Phase 2 = DocLayNet
+layout-crop integration, merged to `main` 2026-06-03). The active branch is
+**Phase 3 (FUNSD relation branch)**, `feature/phase3-funsd-relations` off `main`.
+
+Phase 3 V1 is implemented and scored (entirely local, CPU-only, no Colab):
+
+1. Annotation-only deterministic relation baseline: `src/funsd_extraction.py` (parse + dedupe
+   + per-answer-argmax predictor), `src/eval_funsd.py` (set-based P/R/F1),
+   `scripts/evaluate_funsd.py` (+ `scripts/fetch_funsd.py`), `tests/test_funsd_relations.py`
+   (17 synthetic tests). Full suite 236 passed.
+2. Headline (held-out `test_50.qa_links`): P 0.946 / R 0.590 / **F1 0.727**; secondaries in
+   `DEVLOG.md` (2026-06-03) and `outputs/evaluation/phase3_funsd_relations.json`.
+
+Remaining:
+
+1. Open the Phase 3 PR.
+2. Optional (train-only): tune `HeuristicParams` on `train_149` if higher recall is wanted;
+   never on `test_50`. FUNSD token classification (V2 / seqeval) and threshold-based multi-link
+   matching are future work, not V1.
+3. Phase 4 (full demo + evaluation + report) is the next phase.
 
 ---
 

diff --git a/notebooks/05_phase3_funsd_relations.ipynb b/notebooks/05_phase3_funsd_relations.ipynb
@@ -0,0 +1,254 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Phase 3 - FUNSD relation baseline (Colab runner)\n",
+    "\n",
+    "Runner only: mount Drive, pull the Phase 3 branch, fetch the raw FUNSD annotations, run the synthetic unit gate, then run the real FUNSD relation evaluation.\n",
+    "\n",
+    "Phase 3 is annotation-only and CPU-only. The FUNSD JSON carries entity text, bbox, label, and GT linking pairs, so this notebook does not load image pixels and does not need a GPU. Logic lives in `src/` and `scripts/`, not in this notebook.\n",
+    "\n",
+    "Before running in Colab, make sure `feature/phase3-funsd-relations` has been pushed to GitHub. After Phase 3 merges, set `BRANCH = 'main'` in the boot cell."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Boot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 1. Mount Drive so config.DATA_ROOT and config.OUTPUT_ROOT persist across Colab sessions.\n",
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 2. Get the code onto the VM and pin the Phase 3 branch.\n",
+    "import os\n",
+    "\n",
+    "REPO = '/content/FinDocStructRAG'\n",
+    "BRANCH = 'feature/phase3-funsd-relations'  # change to 'main' after Phase 3 merges\n",
+    "\n",
+    "if not os.path.isdir(f'{REPO}/.git'):\n",
+    "    !git clone --quiet https://github.com/AD2000X/FinDocStructRAG.git {REPO}\n",
+    "\n",
+    "!cd {REPO} && git fetch origin --quiet\n",
+    "!cd {REPO} && git checkout {BRANCH} && git pull --ff-only origin {BRANCH}\n",
+    "!cd {REPO} && echo branch: $(git rev-parse --abbrev-ref HEAD) HEAD: $(git log --oneline -1)\n",
+    "%cd /content/FinDocStructRAG"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 3. Make src/ importable and sanity-check the Phase 3 paths.\n",
+    "import importlib\n",
+    "import sys\n",
+    "\n",
+    "sys.path.insert(0, '/content/FinDocStructRAG')\n",
+    "from src import config\n",
+    "importlib.reload(config)\n",
+    "\n",
+    "print('IN_COLAB    :', config.IN_COLAB)\n",
+    "print('DATA_ROOT   :', config.DATA_ROOT)\n",
+    "print('OUTPUT_ROOT :', config.OUTPUT_ROOT)\n",
+    "print('FUNSD_ROOT  :', config.FUNSD_ROOT)\n",
+    "print('FUNSD_TRAIN :', config.FUNSD_TRAIN)\n",
+    "print('FUNSD_TEST  :', config.FUNSD_TEST)\n",
+    "print('EVALUATION  :', config.EVALUATION)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1 - fetch or reuse FUNSD annotations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Downloads the official FUNSD zip only if it is not already present on Drive.\n",
+    "# It extracts to data/raw/funsd/dataset/...; tests never depend on this data.\n",
+    "!python scripts/fetch_funsd.py"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Dataset count sanity check. Expected: 149 train + 50 test annotations.\n",
+    "import importlib\n",
+    "from src import config\n",
+    "importlib.reload(config)\n",
+    "\n",
+    "n_train = len(list(config.FUNSD_TRAIN.glob('*.json')))\n",
+    "n_test = len(list(config.FUNSD_TEST.glob('*.json')))\n",
+    "print('train annotations:', n_train)\n",
+    "print('test annotations :', n_test)\n",
+    "assert n_train == 149 and n_test == 50, 'Unexpected FUNSD annotation counts'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2 - unit gate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Small dependency for the synthetic acceptance tests. The Phase 3 runtime itself is stdlib-only.\n",
+    "!python -m pip install -q pytest\n",
+    "!python -m pytest tests/test_funsd_relations.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3 - run Phase 3 evaluation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python scripts/evaluate_funsd.py"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the JSON report and display the split x scope matrix.\n",
+    "import json\n",
+    "from pathlib import Path\n",
+    "\n",
+    "from src import config\n",
+    "\n",
+    "report_path = config.EVALUATION / 'phase3_funsd_relations.json'\n",
+    "report = json.loads(report_path.read_text(encoding='utf-8'))\n",
+    "print('report:', report_path)\n",
+    "print('primary:', report['primary'])\n",
+    "print('note   :', report['note'])\n",
+    "\n",
+    "rows = []\n",
+    "for split, scopes in report['results'].items():\n",
+    "    for scope, m in scopes.items():\n",
+    "        rows.append({\n",
+    "            'split': split,\n",
+    "            'scope': scope,\n",
+    "            'precision': round(m['precision'], 3),\n",
+    "            'recall': round(m['recall'], 3),\n",
+    "            'f1': round(m['f1'], 3),\n",
+    "            'tp': m['tp'],\n",
+    "            'pred': m['n_pred'],\n",
+    "            'gold': m['n_gold'],\n",
+    "            'forms': m['num_forms'],\n",
+    "        })\n",
+    "\n",
+    "try:\n",
+    "    import pandas as pd\n",
+    "    display(pd.DataFrame(rows))\n",
+    "except Exception:\n",
+    "    for row in rows:\n",
+    "        print(row)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4 - optional error peek"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Show the lowest-recall held-out forms. This is read-only and does not write artifacts.\n",
+    "from src.funsd_extraction import load_funsd_split, predict_qa_links, qa_gold_links\n",
+    "from src import config\n",
+    "\n",
+    "test_forms = load_funsd_split(config.FUNSD_TEST)\n",
+    "miss_rows = []\n",
+    "for form in test_forms:\n",
+    "    pred = predict_qa_links(form)\n",
+    "    gold = qa_gold_links(form)\n",
+    "    tp = len(pred & gold)\n",
+    "    recall = tp / len(gold) if gold else 0.0\n",
+    "    precision = tp / len(pred) if pred else 0.0\n",
+    "    miss_rows.append({\n",
+    "        'form_id': form['form_id'],\n",
+    "        'precision': round(precision, 3),\n",
+    "        'recall': round(recall, 3),\n",
+    "        'tp': tp,\n",
+    "        'pred': len(pred),\n",
+    "        'gold': len(gold),\n",
+    "    })\n",
+    "\n",
+    "miss_rows = sorted(miss_rows, key=lambda r: (r['recall'], r['precision'], r['form_id']))[:10]\n",
+    "try:\n",
+    "    import pandas as pd\n",
+    "    display(pd.DataFrame(miss_rows))\n",
+    "except Exception:\n",
+    "    for row in miss_rows:\n",
+    "        print(row)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}