Skip to content

Preserve actionable URLs through Gmail preprocessing and condensation#131

Open
constkolesnyak wants to merge 1 commit into
ClickHouse:mainfrom
constkolesnyak:upstream/email-actionable-urls
Open

Preserve actionable URLs through Gmail preprocessing and condensation#131
constkolesnyak wants to merge 1 commit into
ClickHouse:mainfrom
constkolesnyak:upstream/email-actionable-urls

Conversation

@constkolesnyak

Copy link
Copy Markdown
Contributor

Problem

The Gmail source's HTML→text preprocessing in nerve/sources/gmail.py ran a blanket URL-strip pass on every body — every standalone https://… was removed before the email reached the agent or the condenser. That kills the things you actually want to act on:

  • Booking confirmation links (Eventim, calendar invites, Doctolib, etc.)
  • Messaging deep links (chat replies, Discord/Slack notification links)
  • Payment links (PayPal/Stripe receipts → "Pay now")
  • Reply / unsubscribe links inside threaded conversations

What it should have stripped is the boilerplate — mailto:, tracking pixels, footer social links, "view this in your browser" — none of which are usually a problem for the agent but the original blanket pattern was clearly aiming at.

Symptom from production: inbox notifications routinely arrived with the URL removed, forcing manual search of the Gmail web UI for any link the message referenced.

Fix

Replace the blanket URL strip with _is_boilerplate_url(url) heuristics. Boilerplate URLs get removed, everything else passes through unchanged:

  • Heuristic flags hosts/paths matching unsubscribe, manage-?(subscriptions|preferences), tracking pixel domains (Mailchimp click-tracking, etc.), and known social-footer hosts.
  • Anything not matching is preserved (the actionable path).

Also tighten the condense prompt in nerve/sources/runner.py to explicitly tell the condenser model to preserve actionable URLs through summarisation (otherwise a 4k-char condensation pass would strip them again on the LLM side).

Tests

The existing source-runner tests stay green (pytest -k 'gmail or sources or runner': 14 passed, 2 skipped). No new tests added — the heuristic is small and verified end-to-end by routine inbox processing.

Files

  • nerve/sources/gmail.py_is_boilerplate_url + selective strip (+19 lines)
  • nerve/sources/runner.py — condense prompt: keep actionable URLs (+3 lines)

Generated with Claude Code

Gmail preprocessing was stripping all standalone URLs indiscriminately.
Now only removes boilerplate (unsubscribe, social media, tracking pixels)
and keeps actionable links (booking, messaging, payment, reply threads).
Condense prompts updated to explicitly preserve actionable URLs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant