Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,5 @@ venv/
tmp/
memory/

# Local-only engineering journal + scratch screenshots (never committed)
DEVLOG.md
# Scratch screenshots (never committed)
output*.png
10 changes: 6 additions & 4 deletions DESIGN_SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -527,7 +527,7 @@ Rules:

| Module | Approach |
|--------|----------|
| Layout AP/IoU | pycocotools |
| Layout IoU | custom IoU diagnostics + greedy table-level matching; pycocotools optional |
| Table MVP | custom proxy (spanning via `map_spanning_bbox_to_grid`) |
| Table Final | GriTS |
| FUNSD relation V1 | custom + sklearn |
Expand Down Expand Up @@ -572,7 +572,8 @@ PaddleOCR -> OCRWord -> `assign_words_to_cells()` -> `numeric_utils` (V9 `looks_
Table-only RAG QA: build table chunks -> BM25 + dense (exact BGE cosine, `BAAI/bge-small-en-v1.5`; FAISS optional if the corpus grows) + RRF -> Markdown vs linearized serialization experiment (resolved: linearized carries forward) -> source-grounded QA via `src/llm_client.py` -> report GT-filled QA vs OCR-filled QA separately. This is the v1 release (see PLAN.md §2).

### Phase 2 (~1-1.5 weeks)
bbox_utils + label mapping -> DocLayNet -> sequential + fallback.
bbox_utils + label mapping -> HF DocLayNet detector -> table crops -> existing TATR
structure handoff. Evaluate on fixed DocLayNet subsets with custom IoU/table-level matching.

### Phase 3 (~1 week)
FUNSD spatial heuristic + boosts -> relation P/R/F1.
Expand Down Expand Up @@ -679,8 +680,9 @@ test_numeric_utils.py adds/fixes:

> Scope note: these bullets describe the **full-project target** (Phases 1-4). As of now,
> Phases 0-1C are delivered and merged (FinTabNet.c table topology, OCR content extraction,
> and table-only RAG QA = v1). DocLayNet layout (Phase 2) and FUNSD relations (Phase 3) are
> planned / in progress, not yet built — describe current state with the delivered scope.
> and table-only RAG QA = v1). DocLayNet layout (Phase 2) is active on the Phase 2 branch
> with the layout-crop MVP gate implemented/evaluated; FUNSD relations (Phase 3) are still
> planned. Describe current state with the delivered scope.

**Short version:**

Expand Down
785 changes: 785 additions & 0 deletions DEVLOG.md

Large diffs are not rendered by default.

98 changes: 51 additions & 47 deletions PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ strong reason. Items marked "experiment" are decided by Phase results; items mar
| **Embedding model (dense retrieval)** | **`BAAI/bge-small-en-v1.5`** | This plan (V9 unspecified) |
| **Answer-generation LLM** | **OpenRouter (OpenAI-compatible gateway; default `openai/gpt-4o-mini`; single provider, swappable via `src/llm_client.py`)** | This plan |
| Numeric normalization | `src/numeric_utils.py` (V9-fixed `looks_numeric`) | V9 §5.13 |
| Eval tooling | pycocotools / custom retrieval metrics (hit/recall/MRR; ranx optional) / seqeval / sklearn / GriTS (final) / custom proxy | V9 §8 |
| Eval tooling | custom layout IoU + table-level matching (pycocotools optional) / custom retrieval metrics (hit/recall/MRR; ranx optional) / seqeval / sklearn / GriTS (final) / custom proxy | V9 section 8 |
| Demo | Gradio | V9 |
| Environment / workflow | VS Code Colab extension + git-as-truth + `.py` core / `.ipynb` runner | This plan §1 |

Expand Down Expand Up @@ -312,25 +312,24 @@ OCR pipeline. The carry-forward is the table crop; nothing downstream of the cro
**Non-goal (bounded, same discipline as 1C)**: Phase 2 is layout detection + table-crop
handoff only. It does **not** build full-document RAG over text/figure/caption regions (a
mixed-corpus retrieval task that waits until the layout output is trusted). Detecting
non-table classes is for AP/IoU scoring and the future corpus, not for 1C-style answering yet.
non-table classes is for layout diagnostics and the future corpus, not for 1C-style answering yet.

#### Decision to lock first: the page-level layout detector

DocLayNet is the **dataset and the AP/IoU target**, not a model. The pipeline needs a detector
*trained on* DocLayNet classes whose output feeds §9's `id2label -> LAYOUT_LABEL_MAP ->
DocLayNet is the **dataset and IoU target**, not a model. The pipeline needs a detector
*trained on* DocLayNet classes whose output feeds DESIGN_SPEC section 9's `id2label -> LAYOUT_LABEL_MAP ->
normalize_label`. On a T4 this is a **pretrained detector at inference**, not training from
scratch. Candidates:

| Option | API | Notes |
|--------|-----|-------|
| **HF DETR-family fine-tuned on DocLayNet** (deformable-DETR / RT-DETR / DiT variant) | `transformers.AutoModelForObjectDetection` | **Recommended.** Exposes `config.id2label` exactly as §9 assumes; no new heavy dep (transformers already in the stack); T4-friendly. One weight download. |
| **HF DETR-family fine-tuned on DocLayNet** (deformable-DETR / RT-DETR / DiT variant) | `transformers.AutoModelForObjectDetection` | **Recommended.** Exposes `config.id2label` exactly as DESIGN_SPEC section 9 assumes; no new heavy dep (transformers already in the stack); T4-friendly. One weight download. |
| YOLO-DocLayNet (ultralytics) | ultralytics `YOLO` | Fastest on T4, but a non-transformers API (no `config.id2label`), adds the ultralytics dep, needs a label-map shim. |
| PP-DocLayout (PaddleX) | Paddle | Strong, but pulls the Paddle stack (a Phase 1B install pain point) and a non-HF API. |
| Detectron2 / LayoutParser | Detectron2 | **Avoid**: brittle install on Colab T4. |

Recommendation: the **HF DETR-on-DocLayNet** path, to keep §9 literal and add no heavy
dependency. The exact model id is pinned into `config.LAYOUT_MODEL` only after it is confirmed
to load and run on a T4 (verify before locking, like the embedding model was).
Recommendation: the **HF DETR-on-DocLayNet** path, to keep DESIGN_SPEC section 9 literal and add no heavy
dependency. This is now resolved and pinned after the T4 smoke.

**Resolved (T4 smoke, `scripts/smoke_layout_detector.py`):** pinned
`config.LAYOUT_MODEL = "Aryn/deformable-detr-DocLayNet"` - loads in ~3 s, ~1 GB VRAM,
Expand All @@ -340,69 +339,75 @@ with transformers 4.36.2 and uses a timm resnet50 backbone, which transformers>=
loader leaves unloaded (degenerate detections), so `requirements-colab.txt` pins
`transformers==4.49.0` and the smoke warns on v5.

#### Architecture (DESIGN_SPEC §4.1, sequential-first + fallback)
#### Architecture (DESIGN_SPEC section 4.1, sequential-first + fallback)

```
page image -> layout detector -> regions {Text, Title, Table, Figure, ...}
-> normalize_label -> select Table regions
-> bbox_utils crop (with small padding)
-> [low-confidence / empty: fallback to microsoft/table-transformer-detection]
-> [low-confidence primary table: fallback to microsoft/table-transformer-detection]
-> IoU dedup -> each table crop -> EXISTING Phase 1A TATR + Phase 1B OCR (unchanged)
```

#### Phase 2-dev on Colab T4 (pure logic first, synthetic unit tests P3)
#### Phase 2-dev on Colab T4 (pure logic first, synthetic unit tests - P3)

Use the Colab T4 runtime as the Phase 2 development runtime if it is already connected; this
keeps the dependency stack identical to the detector smoke and avoids Windows/Colab drift.
The first implementation steps are still **pure logic**: unit tests use fake boxes / fake
detector outputs and do not call the DocLayNet model. In other words, "pure logic first"
describes the dependency boundary, not the physical machine.

1. `src/bbox_utils.py` (§4.3, the first step): coordinate/format conversions (xyxy<->xywh,
COCO<->pixel, DocLayNet normalized 0-1 <-> page pixels at a pinned render DPI),
crop-with-padding, IoU, IoU-dedup / NMS-merge. All pure, fake-box unit tests, no GPU.
2. `src/layout_parsing.py`: `LAYOUT_LABEL_MAP` + `normalize_label()` (§9), a region dataclass,
1. `src/bbox_utils.py` (DESIGN_SPEC section 4.3, first step): xyxy<->xywh conversion,
clamp, crop-with-padding, IoU, and deterministic IoU-dedup. All pure, fake-box unit
tests, no GPU.
2. `src/layout_parsing.py`: `LAYOUT_LABEL_MAP` + `normalize_label()` (DESIGN_SPEC section 9), a region dataclass,
and the **sequential + fallback** selection/dedup as a pure function over detector outputs
with the **detector injected** (the `llm_client.complete` pattern), so the whole
page->crops path is testable with fake detections and no GPU.
3. `src/table_detection.py`: the fallback table-detector adapter
(`microsoft/table-transformer-detection`, already in config) behind the same region contract.
4. Tests: `test_label_mapping.py` (§12/§13), bbox conversions / IoU / dedup, the
4. Tests: `test_layout_parsing.py`, `test_bbox_utils.py`, and `test_tatr_postprocess.py`: bbox conversions / IoU / dedup, the
fallback-trigger rule (low confidence -> fallback fires; high confidence -> it does not),
crop-coordinate correctness.

#### Phase 2-colab inference / eval (same T4 runtime, real detection)

1. DocLayNet access via HF `datasets`: `unique("doc_category")` first, then filter (§2); fixed
random subsets per §18.6 (debug seed 7 / mvp seed 42).
1. DocLayNet access via HF `datasets`: fixed random subsets per DESIGN_SPEC section 18.6
(debug seed 7 / mvp seed 42).
2. Run the layout detector on a DocLayNet page subset; persist predicted regions + table crops
to a **Phase-2-owned** artifact stream (`outputs/layout/`), kept separate from the FinTabNet
table outputs (P4 separation discipline).
3. Feed detected table crops through the **existing** Phase 1A TATR topology pipeline to confirm
3. Score table crops against DocLayNet GT with custom IoU diagnostics and greedy table-level
matching (`scripts/eval_layout_iou.py`), not whole-dataset claims.
4. Feed detected table crops through the **existing** Phase 1A TATR topology pipeline to confirm
the handoff end-to-end (full page -> detected crop -> grid), the FinTabNet->page generalization.
4. Layout AP/IoU via `pycocotools` against DocLayNet GT regions.

#### New config / artifacts

- `config.LAYOUT_MODEL = "Aryn/deformable-detr-DocLayNet"`, DocLayNet paths, a pinned render DPI.
- `outputs/layout/` (predicted regions + detected crops) and `outputs/manifests/phase2_layout_<run>.csv`
per the §18 manifest convention; detected crops never mixed into the FinTabNet streams.
- `outputs/layout/` (`regions/`, `crops/`, `manifest.csv`, diagnostic CSVs, and
`smoke_structure.csv`); detected crops never mixed into the FinTabNet streams.

**Acceptance criteria**
- `bbox_utils`, `normalize_label`, and the sequential+fallback selection unit tests pass in the
Colab T4 runtime without detector/model calls (and may also pass locally) (P3).
- Layout AP/IoU reported on a fixed DocLayNet subset (`pycocotools`), with `processed / skipped /
failed` + the subset descriptor (§18.6) — never phrased as a whole-dataset number.
- End-to-end: a table crop detected from a full page runs through the existing Phase 1A TATR
topology pipeline and produces a grid.
- `bbox_utils`, `normalize_label`, sequential/fallback selection, and row/col band dedup unit
tests pass without detector/model calls (P3).
- Layout crop quality reported on fixed DocLayNet subsets with subset descriptor, seed, and n.
Current MVP gate (seed 42, n=200): mean best table-crop IoU 0.900 on GT-table pages;
table-level matched@0.50 recall 0.900 / precision 0.916; matched@0.75 recall 0.880 /
precision 0.895; table-free false-positive crop rate 13/200 = 6.5%.
- End-to-end: detected table crops run through the existing Phase 1A TATR structure pipeline.
With `dedup_row_col_bands` as the default, the n=50 smoke is 50/50 OK. The full n=286
smoke must be rerun after the empty-grid validator fix; expected result is 285 OK / 1 WARN
(`val_000670_table_1`, no row boxes), still below the 5% WARN gate.
- P4 held: DocLayNet GT regions vs detected regions stay separate artifacts; detected crops are
not reported as GT.

**Risks specific to Phase 2**
- Coordinate-space mismatch (DocLayNet normalized 0-1 / COCO xywh / PDF points / rendered-image
pixels) -> centralize every conversion in `bbox_utils`, pin one render DPI.
- Double counting between the layout-table path and the table-transformer-detection fallback ->
IoU dedup with a fixed, unit-tested threshold.
IoU dedup with a fixed, unit-tested threshold (`dedup_iou=0.70`). Fallback is skipped when
primary finds zero tables because TATR creates too many table-free false positives.
- A DocLayNet "Table" crop may be looser than what TATR-fin expects (TATR trained on tight
crops) -> small padding, validate the handoff on the debug subset before scaling.
- Detectron2 / Paddle install brittleness -> the HF transformers detector avoids it.
Expand Down Expand Up @@ -529,33 +534,32 @@ Implementation details:
| 1A | Yes | table topology reconstruction + metrics + 9 screenshots |
| 1B | | OCR content extraction + content metrics |
| 1C | | **table-only RAG end-to-end demo loop** |
| 2 | | DocLayNet layout + table crop integration |
| 2 | | DocLayNet layout + table crop integration (layout-crop MVP scored on branch) |
| 3 | | FUNSD relation P/R/F1 |
| 4 | | Gradio demo + full eval + report |

---

## 7. Next steps

**Phases 0 through 1C are complete and merged** (v1 = table-only RAG). The next track is
**Phase 2 (DocLayNet layout integration)**; do not open a second track in parallel. The
detector smoke is complete and `config.LAYOUT_MODEL` is pinned to
`Aryn/deformable-detr-DocLayNet` (see §3 Phase 2). Keep using the Colab T4 runtime for Phase
2 development if it is already connected, but keep unit tests model-free until the inference
step.
Order:

1. **Phase 2-dev first on T4, pure logic / synthetic unit tests (P3)**: `bbox_utils.py`
(coordinate conversions, IoU, crop-with-padding, dedup), `layout_parsing.py` (`LAYOUT_LABEL_MAP` +
`normalize_label` + the sequential/fallback selection over an injected detector),
`table_detection.py` (the `table-transformer-detection` fallback adapter).
2. **Then real T4 inference / eval**: run detection on a fixed DocLayNet subset, layout AP/IoU via
`pycocotools`, and confirm a detected table crop feeds the existing Phase 1A topology pipeline.
**Phases 0 through 1C are complete and merged** (v1 = table-only RAG). The active branch is
**Phase 2 (DocLayNet layout integration)**. The detector is pinned, the pure geometry/layout
modules are tested, the fixed DocLayNet MVP subset is scored, and the crop->TATR structure
handoff is validated on sampled crops.

Remaining Phase 2 close-out:

1. Pull the branch on Colab and rerun Step 7d with the tightened empty-grid validator
(`scripts/smoke_structure.py --n 286 --seed 42`). Expected: 285 OK / 1 WARN, still under
the <=5% WARN gate.
2. Record the confirmed Step 7d result in `DEVLOG.md`, then open the Phase 2 PR.
3. After merge, repin Colab notebooks to `main`; do not start Phase 3 until the Phase 2 PR is
merged or explicitly paused.

---

### Sources

- [Google Colab is Coming to VS Code Google Developers Blog](https://developers.googleblog.com/google-colab-is-coming-to-vs-code/)
- [How do I access local files from colab kernel? googlecolab/colab-vscode#210](https://github.com/googlecolab/colab-vscode/issues/210)
- [Google Colab VS Code Extension Marketplace](https://marketplace.visualstudio.com/items?itemName=Google.colab)
- [Google Colab is Coming to VS Code - Google Developers Blog](https://developers.googleblog.com/google-colab-is-coming-to-vs-code/)
- [How do I access local files from colab kernel? - googlecolab/colab-vscode#210](https://github.com/googlecolab/colab-vscode/issues/210)
- [Google Colab VS Code Extension - Marketplace](https://marketplace.visualstudio.com/items?itemName=Google.colab)
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ spanning-cell mapping, grid validation, occupancy-aware HTML parsing); Phase 1B
content extraction (word-to-cell assignment, financial number normalization, content
metrics); and Phase 1C table-only RAG (BM25 + dense BGE cosine + RRF retrieval, one
chunk per table, single-provider grounded answer generation, GT-filled vs OCR-filled
corpora scored separately). Next is Phase 2 (DocLayNet layout integration: page-level
region detection -> table crop -> the existing Phase 1A/1B pipeline). See
[PLAN.md](PLAN.md) for the phase roadmap.
corpora scored separately).

Current branch: Phase 2 (DocLayNet layout integration) is the active follow-up:
page-level region detection -> table crop -> the existing Phase 1A/1B pipeline.
The layout-crop MVP gate is implemented and scored on fixed DocLayNet subsets; the
remaining close-out is the full crop->TATR structure smoke rerun after the tightened
empty-grid validator. See [PLAN.md](PLAN.md) for the phase roadmap.
Loading
Loading