Skip to content

stage 3: also trigger fallback when chosen subtree is <5% of body full-text#3

Merged
abimaelmartell merged 1 commit into
mainfrom
fix/stage3-fallback-link-heavy-pages
May 20, 2026
Merged

stage 3: also trigger fallback when chosen subtree is <5% of body full-text#3
abimaelmartell merged 1 commit into
mainfrom
fix/stage3-fallback-link-heavy-pages

Conversation

@abimaelmartell
Copy link
Copy Markdown
Member

The existing suspiciously-small fallback trigger compares text_len_excluding_links on both the body and the chosen subtree. On pages where nearly all text is link text (table-layout listings of all-anchor rows), this metric is near-zero on both sides and the disparity ratio can't distinguish a small intro from the substantive content — the trigger doesn't fire and Stage 3's bad pick wins.

Fix

Adds a parallel full-text comparison, OR'd with the existing 15%-excl-links check:

body_full_text  >= 1000
AND  kept_full_text * 100  <  body_full_text * 5

Tighter 5% threshold (vs 15% for the excl-links variant) and a higher 1000-char minimum body, to avoid false positives on small marketing pages whose hero/footer are link-heavy.

Empirical impact

23-URL spot-check covering articles, docs, forums, listings, product pages, marketing:

  • One table-layout listing page: 0.1 KB → 2.8 KB (28×); extraction_quality 0.10 → 0.20. Output is the actual list items, not just the short intro paragraph the scored walk had been locking onto.
  • 22 other URLs: extraction output unchanged.
  • Golden corpus: 54/54 fixtures pass.
  • Unit + integration + doctests: 37/37 pass.

…l-text

The suspiciously-small fallback trigger added in the previous patch compares
text-excluding-links on both sides. On pages where nearly all text IS link
text (table-layout listings of all-anchor rows), text_len_excluding_links is
near-zero for both the body and the chosen subtree, so the disparity ratio
can't be computed meaningfully and the trigger doesn't fire.

Adds a parallel full-text comparison: trigger fallback when

  body_full_text  >= 1000
  AND  kept_full_text * 100  <  body_full_text * 5

OR'd with the existing excl-links 15% check. Tighter 5% threshold + 1000-
char minimum body to avoid false positives on small marketing pages whose
hero/footer are link-heavy.

Empirical impact on a 23-URL spot-check:

  - One link-heavy listing page: 0.1 KB → 2.8 KB (28×); extraction_quality
    0.10 → 0.20. Output is the actual list items, not just the intro
    paragraph the scored walk had been locking onto.
  - 22 other URLs: extraction output unchanged.
  - Golden corpus: 54/54 fixtures still pass.
  - 37/37 unit + integration + doctests pass.
@abimaelmartell abimaelmartell merged commit 418e53d into main May 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant