Skip to content

classifier: stronger URL signal + harmonic class scaling#1

Merged
abimaelmartell merged 1 commit into
mainfrom
fix/classifier-url-signal-priority
May 20, 2026
Merged

classifier: stronger URL signal + harmonic class scaling#1
abimaelmartell merged 1 commit into
mainfrom
fix/classifier-url-signal-priority

Conversation

@abimaelmartell
Copy link
Copy Markdown
Member

Fixes a classifier failure mode where repeated class hits (e.g. a 20-card product grid on a /pricing page) accumulated to dominate the URL signal, causing pages like Stripe's pricing to mis-classify as Product even though URL_SERVICE matched /pricing. Three changes:

  1. URL signal weight bumped from 3.0 to 5.0. URL patterns are the site author's deliberate routing intent and historically the most reliable single signal.
  2. Class signals now use harmonic (sub-linear) scaling. First match is full-weight evidence; each repeat counts for less (1 match → +1.0, 5 → +2.28, 20 → +3.6). Stops repeated component-class noise from drowning stronger signals.
  3. Drop topic from URL_FORUM. /topics/ is widely used as a docs section path (Django: /en/5.0/topics/db/queries/) — this was mis-classifying docs as forums in the new weight regime.

Net behavior

On a 23-URL real-world spot check (Wikipedia / docs / Stack Overflow / GitHub / Stripe / Apple / etc.):

URL before after direction
MDN Array docs Listing Documentation ✅ correct
Stripe /pricing Product (conf 0.22) Other (low conf fallback) ✅ no longer falsely confident in wrong type
Django docs /topics/db/queries/ Documentation Documentation (kept; regression averted by topic drop) ✅ neutral
Other 20 URLs unchanged unchanged

All 54 golden-corpus fixtures still pass. 37 unit + integration + doctests pass. Clippy + fmt clean.

What this does NOT fix

Two known production failure cases remain because the bug is in Stage 3 scoring, not the classifier:

  • news.ycombinator.com/jobs produces near-empty extraction (0.1 KB vs Python trafilatura's 2.8 KB on the same HTML). Classifier correctly returns Other with low confidence (URL doesn't match any pattern), but Stage 3's scored walk on table-layout pages with high self-link density drops too aggressively. Separate follow-up.
  • stripe.com/pricing still picks the wrong subtree (product nav menu vs the pricing tables) even though this PR now flags low confidence on it. Same Stage 3 issue — when the page is genuinely structurally weird, classifier guesses correctly that it doesn't know, but the scored walk still emits a wrong-but-confident-looking output.

Both are worth a follow-up PR that touches scoring.rs + fallback.rs.

Tests added

  • service_url_beats_repeated_product_class_noise — exercises the Stripe failure mode (20× product-class divs + /pricing URL) and asserts the classifier returns Service.
  • harmonic_class_scaling — covers the scaling shape (0 → 0, 1 → 1.0, 2 → 1.5, 20 → ≈3.6).

Two changes to the page-type classifier:

1. URL signal weight bumped 3.0 → 5.0. URL patterns (/pricing, /docs,
   /products, /article) are deliberate routing decisions by the site
   author and historically the most reliable single signal. Bumping
   them keeps URL hits dominant when class-hint noise accumulates.

2. Class signals now use harmonic (sub-linear) scaling: 1 match → +1.0,
   2 → +1.5, 5 → +2.28, 20 → +3.6. Previously each match added a flat
   +1.0, which let a 20-card product-grid drown out URL_SERVICE on
   /pricing pages (Stripe-style). First match is full-weight evidence;
   each repeat counts for less.

3. Drop `topic` from URL_FORUM. `/topics/` is widely used as a docs
   section path (Django: /en/5.0/topics/db/queries/) and mis-classifies
   docs as forums. `thread`/`discussion`/`question` are the
   unambiguous forum signals.

Net behavior change on 23-URL real-world spot check:
- MDN Array docs: Listing → Documentation (correct)
- Stripe /pricing: Product → Other (was wrongly Product; now safely uncertain)
- Django docs: Documentation → kept as Documentation (regression averted by
  the `topic` drop)
- All other 20 URLs: classification unchanged
- Golden corpus: all 54 fixtures still pass

Known limitations (NOT fixed in this PR):
- HN jobs page produces near-empty extraction (0.1 KB vs Python
  trafilatura's 2.8 KB on same HTML). Classifier correctly returns
  Other with low confidence; the bug is in Stage 3 scoring on table-
  layout pages with high self-link density. Separate follow-up.
- Stripe /pricing still picks the wrong subtree (product nav vs pricing
  tables) even though the classifier now flags low confidence. Same
  Stage 3 issue.

Adds 2 unit tests: one for the Stripe-style failure (service URL +
product-class noise) and one for the harmonic-scaling shape.
@abimaelmartell abimaelmartell merged commit 63c17e1 into main May 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant