classifier: stronger URL signal + harmonic class scaling#1
Merged
Conversation
Two changes to the page-type classifier: 1. URL signal weight bumped 3.0 → 5.0. URL patterns (/pricing, /docs, /products, /article) are deliberate routing decisions by the site author and historically the most reliable single signal. Bumping them keeps URL hits dominant when class-hint noise accumulates. 2. Class signals now use harmonic (sub-linear) scaling: 1 match → +1.0, 2 → +1.5, 5 → +2.28, 20 → +3.6. Previously each match added a flat +1.0, which let a 20-card product-grid drown out URL_SERVICE on /pricing pages (Stripe-style). First match is full-weight evidence; each repeat counts for less. 3. Drop `topic` from URL_FORUM. `/topics/` is widely used as a docs section path (Django: /en/5.0/topics/db/queries/) and mis-classifies docs as forums. `thread`/`discussion`/`question` are the unambiguous forum signals. Net behavior change on 23-URL real-world spot check: - MDN Array docs: Listing → Documentation (correct) - Stripe /pricing: Product → Other (was wrongly Product; now safely uncertain) - Django docs: Documentation → kept as Documentation (regression averted by the `topic` drop) - All other 20 URLs: classification unchanged - Golden corpus: all 54 fixtures still pass Known limitations (NOT fixed in this PR): - HN jobs page produces near-empty extraction (0.1 KB vs Python trafilatura's 2.8 KB on same HTML). Classifier correctly returns Other with low confidence; the bug is in Stage 3 scoring on table- layout pages with high self-link density. Separate follow-up. - Stripe /pricing still picks the wrong subtree (product nav vs pricing tables) even though the classifier now flags low confidence. Same Stage 3 issue. Adds 2 unit tests: one for the Stripe-style failure (service URL + product-class noise) and one for the harmonic-scaling shape.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes a classifier failure mode where repeated class hits (e.g. a 20-card product grid on a
/pricingpage) accumulated to dominate the URL signal, causing pages like Stripe's pricing to mis-classify as Product even thoughURL_SERVICEmatched/pricing. Three changes:3.0to5.0. URL patterns are the site author's deliberate routing intent and historically the most reliable single signal.topicfromURL_FORUM./topics/is widely used as a docs section path (Django:/en/5.0/topics/db/queries/) — this was mis-classifying docs as forums in the new weight regime.Net behavior
On a 23-URL real-world spot check (Wikipedia / docs / Stack Overflow / GitHub / Stripe / Apple / etc.):
ListingDocumentation/pricingProduct(conf 0.22)Other(low conf fallback)/topics/db/queries/DocumentationDocumentation(kept; regression averted bytopicdrop)All 54 golden-corpus fixtures still pass. 37 unit + integration + doctests pass. Clippy + fmt clean.
What this does NOT fix
Two known production failure cases remain because the bug is in Stage 3 scoring, not the classifier:
news.ycombinator.com/jobsproduces near-empty extraction (0.1 KB vs Python trafilatura's 2.8 KB on the same HTML). Classifier correctly returnsOtherwith low confidence (URL doesn't match any pattern), but Stage 3's scored walk on table-layout pages with high self-link density drops too aggressively. Separate follow-up.stripe.com/pricingstill picks the wrong subtree (product nav menu vs the pricing tables) even though this PR now flags low confidence on it. Same Stage 3 issue — when the page is genuinely structurally weird, classifier guesses correctly that it doesn't know, but the scored walk still emits a wrong-but-confident-looking output.Both are worth a follow-up PR that touches
scoring.rs+fallback.rs.Tests added
service_url_beats_repeated_product_class_noise— exercises the Stripe failure mode (20× product-class divs +/pricingURL) and asserts the classifier returnsService.harmonic_class_scaling— covers the scaling shape (0 → 0, 1 → 1.0, 2 → 1.5, 20 → ≈3.6).