stage 3: also trigger fallback when chosen subtree is <5% of body full-text#3
Merged
Merged
Conversation
…l-text
The suspiciously-small fallback trigger added in the previous patch compares
text-excluding-links on both sides. On pages where nearly all text IS link
text (table-layout listings of all-anchor rows), text_len_excluding_links is
near-zero for both the body and the chosen subtree, so the disparity ratio
can't be computed meaningfully and the trigger doesn't fire.
Adds a parallel full-text comparison: trigger fallback when
body_full_text >= 1000
AND kept_full_text * 100 < body_full_text * 5
OR'd with the existing excl-links 15% check. Tighter 5% threshold + 1000-
char minimum body to avoid false positives on small marketing pages whose
hero/footer are link-heavy.
Empirical impact on a 23-URL spot-check:
- One link-heavy listing page: 0.1 KB → 2.8 KB (28×); extraction_quality
0.10 → 0.20. Output is the actual list items, not just the intro
paragraph the scored walk had been locking onto.
- 22 other URLs: extraction output unchanged.
- Golden corpus: 54/54 fixtures still pass.
- 37/37 unit + integration + doctests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The existing suspiciously-small fallback trigger compares
text_len_excluding_linkson both the body and the chosen subtree. On pages where nearly all text is link text (table-layout listings of all-anchor rows), this metric is near-zero on both sides and the disparity ratio can't distinguish a small intro from the substantive content — the trigger doesn't fire and Stage 3's bad pick wins.Fix
Adds a parallel full-text comparison, OR'd with the existing 15%-excl-links check:
Tighter 5% threshold (vs 15% for the excl-links variant) and a higher 1000-char minimum body, to avoid false positives on small marketing pages whose hero/footer are link-heavy.
Empirical impact
23-URL spot-check covering articles, docs, forums, listings, product pages, marketing:
extraction_quality0.10 → 0.20. Output is the actual list items, not just the short intro paragraph the scored walk had been locking onto.