stage 3: also trigger fallback when chosen subtree is <5% of body full-text by abimaelmartell · Pull Request #3 · firecrawl/html-extractor

abimaelmartell · 2026-05-20T21:20:22Z

The existing suspiciously-small fallback trigger compares text_len_excluding_links on both the body and the chosen subtree. On pages where nearly all text is link text (table-layout listings of all-anchor rows), this metric is near-zero on both sides and the disparity ratio can't distinguish a small intro from the substantive content — the trigger doesn't fire and Stage 3's bad pick wins.

Fix

Adds a parallel full-text comparison, OR'd with the existing 15%-excl-links check:

body_full_text  >= 1000
AND  kept_full_text * 100  <  body_full_text * 5

Tighter 5% threshold (vs 15% for the excl-links variant) and a higher 1000-char minimum body, to avoid false positives on small marketing pages whose hero/footer are link-heavy.

Empirical impact

23-URL spot-check covering articles, docs, forums, listings, product pages, marketing:

One table-layout listing page: 0.1 KB → 2.8 KB (28×); extraction_quality 0.10 → 0.20. Output is the actual list items, not just the short intro paragraph the scored walk had been locking onto.
22 other URLs: extraction output unchanged.
Golden corpus: 54/54 fixtures pass.
Unit + integration + doctests: 37/37 pass.

…l-text The suspiciously-small fallback trigger added in the previous patch compares text-excluding-links on both sides. On pages where nearly all text IS link text (table-layout listings of all-anchor rows), text_len_excluding_links is near-zero for both the body and the chosen subtree, so the disparity ratio can't be computed meaningfully and the trigger doesn't fire. Adds a parallel full-text comparison: trigger fallback when body_full_text >= 1000 AND kept_full_text * 100 < body_full_text * 5 OR'd with the existing excl-links 15% check. Tighter 5% threshold + 1000- char minimum body to avoid false positives on small marketing pages whose hero/footer are link-heavy. Empirical impact on a 23-URL spot-check: - One link-heavy listing page: 0.1 KB → 2.8 KB (28×); extraction_quality 0.10 → 0.20. Output is the actual list items, not just the intro paragraph the scored walk had been locking onto. - 22 other URLs: extraction output unchanged. - Golden corpus: 54/54 fixtures still pass. - 37/37 unit + integration + doctests pass.

abimaelmartell merged commit 418e53d into main May 20, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage 3: also trigger fallback when chosen subtree is <5% of body full-text#3

stage 3: also trigger fallback when chosen subtree is <5% of body full-text#3
abimaelmartell merged 1 commit into
mainfrom
fix/stage3-fallback-link-heavy-pages

abimaelmartell commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abimaelmartell commented May 20, 2026

Fix

Empirical impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant