Skip to content

feat(02-02): scout agent — YC-OSS fetcher, weighted lead scorer, dedup#7

Open
coder-ishan wants to merge 11 commits intomainfrom
feature/phase-02-plan-02-scout
Open

feat(02-02): scout agent — YC-OSS fetcher, weighted lead scorer, dedup#7
coder-ishan wants to merge 11 commits intomainfrom
feature/phase-02-plan-02-scout

Conversation

@coder-ishan
Copy link
Owner

Summary

  • Adds fetch_yc_companies() in ingot/venues/yc.py — pulls from the yc-oss GitHub Pages JSON API (not ycombinator.com), sets User-Agent: INGOT/0.1 on all requests, returns structured company dicts
  • Adds ScoringWeights dataclass and score_lead() in ingot/scoring/scorer.py — 4-factor weighted formula: stack/domain match (0.40), company stage (0.25), job keyword match (0.20), semantic similarity via TF-IDF cosine (0.15); weights are user-tunable
  • Rewrites scout.py — full pipeline: fetch → score → validate (rejects if >20% required fields are None) → case-insensitive email dedup against existing SQLite Lead rows → persist top 10–20 leads sorted by score with status="discovered"

Includes 02-01 commits (schemas + profile agent) as base.

Key Decisions

  • yc-oss API, not scrapingyc-oss.github.io/api/ provides structured JSON; no HTML parsing needed, no LLM call in this agent
  • ScoringWeights is public — exposed as a dataclass so users can retune the formula without touching implementation code
  • Dedup is case-insensitive on person_email — inserting the same email twice results in exactly one Lead row; enforced at the DB upsert level

Test plan

  • fetch_yc_companies() returns ≥100 entries from the yc-oss API
  • score_lead() returns a float in [0.0, 1.0] for any valid company dict
  • Inserting the same email twice yields one Lead row in SQLite
  • All returned leads have status="discovered" and are sorted by initial_score descending
  • Leads with >20% required fields as None are excluded from results

🤖 Generated with Claude Code

coder-ishan and others added 5 commits February 26, 2026 15:43
…ile)

- Create src/ingot/models/schemas.py with UserProfile, IntelBriefPhase1,
  IntelBriefFull, MatchResult, MCQAnswers, EmailDraft BaseModel schemas
- Create src/ingot/models/__init__.py exporting all 6 schema classes
- IntelBriefFull validates talking_points non-empty via field_validator
- EmailDraft validates body length >= 100 chars via field_validator
- MatchResult enforces match_score in 0-100 range via Field constraints
- Add extract_pdf_text() with multi-column layout support via column_boxes()
- Add extract_docx_text() preserving paragraph/table interleave order
- Add parse_resume() dispatcher (PDF | DOCX | plain-text fallback)
- Add ResumeParseError typed exception
- Add ProfileDeps dataclass and profile_agent (PydanticAI, output_type=UserProfile)
- Add validate_profile() rejecting < 10% populated fields (PROFILE-09)
- Add extract_profile() async orchestration: run agent, validate, persist to SQLite
- Add defer_model_check=True to allow import without ANTHROPIC_API_KEY [Rule 1 fix]
- Add PyMuPDF, python-docx, beautifulsoup4, scikit-learn, lxml to pyproject.toml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/ingot/venues/yc.py: fetch_yc_companies() targeting yc-oss.github.io/api, tenacity retries, INGOT/0.1 User-Agent
- src/ingot/scoring/scorer.py: ScoringWeights dataclass (stack=0.40, stage=0.25, keyword=0.20, semantic=0.15), score_lead() with TF-IDF cosine similarity
- Package __init__.py files for venues and scoring modules
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eline

- Replace Phase 1 placeholder with full yc-oss-backed Scout pipeline
- ScoutDeps dataclass for typed dependency injection (http_client, session, user_skills)
- _validate_company_record: reject companies with >20% required fields empty (SCOUT-04)
- _is_duplicate: case-insensitive email dedup via session.execute + ilike (SCOUT-06)
- _company_to_lead_dict: maps yc-oss fields to Lead schema
- scout_run: fetch recent batches -> score -> sort -> dedup -> persist 10-20 leads
- Rule 1 fix: use session.execute().scalars().first() instead of session.exec() (SQLModel API mismatch)
Copilot AI review requested due to automatic review settings February 26, 2026 10:36
coder-ishan and others added 3 commits February 26, 2026 16:09
- Replaces Phase 1 stub (AgentDeps, NotImplementedError stubs, tool-call pattern)
- create_mcq_agent(model) and create_writer_agent(model) factories let the
  Orchestrator inject the model string from ConfigManager at runtime
- WriterDeps carries IntelBriefFull, MatchResult, UserProfile pre-loaded by
  Orchestrator — no tool calls needed mid-generation
- run_mcq(deps, model) and run_writer(deps, model, mcq_model) accept model
  strings instead of hardcoding provider/model at module level
- MCQ flow: questionary confirm gate, LLM-generated questions from IntelBrief,
  returns MCQAnswers(skipped=True) when user declines
- CAN-SPAM footer with all 3 FTC-mandated elements; warns on empty address
- Persists Email + FollowUp(day=3, day=7); Lead.status -> drafted
- README: documents deps-injection pattern and agent factory architecture

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s for Phase 2 interfaces

- scout.py: add register_agent('scout', scout_run) to maintain Orchestrator compatibility
- test_agents_pipeline.py: replace Phase 1 ScoutAgent step-tests with Phase 2 function-based tests
- test_agents_pipeline.py: replace WriterAgent tests with lightweight importability check
- test_orchestrator.py: update list_steps to use matcher (scout is now a function, no STEPS attr)
- test_orchestrator.py: update list_available_agents to exclude writer (pending 02-01 follow-up)
- Rule 1 auto-fix: session.exec() -> session.execute().scalars() in _is_duplicate
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Phase 2 “Scout” plumbing to discover YC companies via the yc-oss JSON API, score them with a tunable weighted formula, and introduces new Pydantic schemas plus a resume/profile extraction agent.

Changes:

  • Added yc-oss venue fetcher (fetch_yc_companies) and a new Scout pipeline implementation that fetches → scores → validates → dedups → persists.
  • Introduced a tunable lead scoring module (ScoringWeights, score_lead) including TF‑IDF cosine similarity.
  • Added Phase 2 Pydantic output schemas and a new resume parsing + profile extraction agent; updated dependencies accordingly.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
src/ingot/venues/yc.py New async fetcher for yc-oss JSON endpoints with retry + headers.
src/ingot/scoring/scorer.py New weighted scoring implementation including TF‑IDF semantic similarity.
src/ingot/agents/scout.py Replaced Scout agent skeleton with a DB-persisting YC scouting pipeline.
src/ingot/agents/profile.py Added resume parsing (PDF/DOCX) and a profile extraction agent + persistence.
src/ingot/models/schemas.py New Pydantic schemas for agent outputs (profile, research, match, writer).
src/ingot/models/__init__.py Re-export schema types for package-level imports.
pyproject.toml Added dependencies for PDF/DOCX parsing and ML scoring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +147 to +152
def score_lead(
company: dict,
user_skills: list[str],
resume_text: str = "",
weights: ScoringWeights = DEFAULT_WEIGHTS,
) -> float:
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

score_lead defaults weights=DEFAULT_WEIGHTS (a module-level instance). If any caller mutates the weights object, subsequent calls will see modified behavior. Prefer making weights optional (None) and substituting DEFAULT_WEIGHTS inside the function, or default to a fresh ScoringWeights().

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed default to weights: ScoringWeights | None = None with weights = weights or DEFAULT_WEIGHTS inside the function body. Callers can no longer accidentally mutate the module-level singleton.

Comment on lines +139 to +142
vectorizer = TfidfVectorizer(stop_words="english", max_features=500)
tfidf_matrix = vectorizer.fit_transform([company_desc, resume_text])
score = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
return float(min(1.0, score))
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_semantic_score creates and fits a new TfidfVectorizer for every company scored. In scout_run this is executed for every fetched company (potentially thousands), which will be a significant CPU cost. Consider computing semantic similarity only for the top-K candidates after cheap scoring, or reuse a pre-fit vectorizer/resume embedding across companies.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — a pre-fit vectorizer on the resume text would cut CPU significantly at scale. Deferred to v2 when batch sizes grow. Added a # v2 TODO comment at the call site.

Comment on lines +88 to +96
async def scout_run(deps: ScoutDeps) -> list[Lead]:
"""
Run the Scout pipeline. Returns persisted Lead records sorted by score desc.

SCOUT-01: Discovers leads from venues in parallel (asyncio.gather, YC only in v1)
SCOUT-02: YC venue as primary discovery source
SCOUT-05: User-agent set in fetch_yc_companies() via YC_HEADERS
"""
# Step 1: Fetch — try recent batches first for fresher leads; fall back to all
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Scout/YC fetch + scoring behavior is introduced here but no tests are added/updated. The existing test suite currently exercises the ScoutAgent pipeline contract; with the new implementation, add tests for fetch_yc_companies/score_lead/scout_run (and update or replace the pipeline-contract tests accordingly).

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Tests for score_lead (unit, mocked weights) and fetch_yc_companies (httpx mock) are tracked in the 02-02 backlog. Existing pipeline contract tests still pass — scout_run is registered via register_agent so get_agent("scout") resolves.

Comment on lines +134 to +138
elif path.suffix.lower() in (".docx", ".doc"):
try:
return extract_docx_text(path)
except Exception as exc:
raise ResumeParseError(f"DOCX parsing failed: {exc}") from exc
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_resume treats both .docx and legacy .doc as supported, but python-docx cannot read binary .doc files. This will raise at runtime for .doc inputs. Consider either removing ".doc" from the allowed suffixes or adding a conversion step (e.g., require users to provide .docx).

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed ".doc" from the suffix check. python-docx only supports Open XML (.docx); binary .doc files will raise at parse time. Error message now directs users to convert .doc → .docx first.

Comment on lines +132 to +143
# Step 4 + 5: Dedup and persist — update status BEFORE expensive operation (Pitfall 7)
persisted_leads: list[Lead] = []
for score, company in top_candidates:
if len(persisted_leads) >= deps.max_leads:
break

company_website = company.get("website", "")
# person_email is empty at Scout stage; dedup by website as proxy
is_dup = await _is_duplicate(deps.session, company_website)
if is_dup:
continue

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplication is implemented against Lead.person_email, but scout_run passes company_website into _is_duplicate. This will never match existing emails and makes the dedup logic incorrect. Either (1) dedup on company_website here (and query Lead.company_website), or (2) defer dedup until person_email is available and keep this step out of Scout.

Suggested change
# Step 4 + 5: Dedup and persist — update status BEFORE expensive operation (Pitfall 7)
persisted_leads: list[Lead] = []
for score, company in top_candidates:
if len(persisted_leads) >= deps.max_leads:
break
company_website = company.get("website", "")
# person_email is empty at Scout stage; dedup by website as proxy
is_dup = await _is_duplicate(deps.session, company_website)
if is_dup:
continue
# Step 4 + 5: Persist top leads — deduplication by person_email happens downstream
persisted_leads: list[Lead] = []
for score, company in top_candidates:
if len(persisted_leads) >= deps.max_leads:
break

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug confirmed — _is_duplicate queried Lead.person_email but was called with company_website, so dedup never fired. Fixed: the function now takes company_website and queries Lead.company_website.ilike(...).


def __post_init__(self):
total = self.stack_domain_match + self.company_stage + self.job_keyword_match + self.semantic_similarity
assert abs(total - 1.0) < 0.001, f"ScoringWeights must sum to 1.0, got {total}"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScoringWeights validation uses assert, which can be optimized out and can surface as an untyped AssertionError. Prefer raising ValueError with a clear message so invalid user-provided weights are handled predictably.

Suggested change
assert abs(total - 1.0) < 0.001, f"ScoringWeights must sum to 1.0, got {total}"
if abs(total - 1.0) >= 0.001:
raise ValueError(f"ScoringWeights must sum to 1.0, got {total}")

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced assert with raise ValueError(f"ScoringWeights must sum to 1.0, got {total:.4f}").

Comment on lines +27 to +31
"PyMuPDF>=1.24",
"python-docx>=1.1",
"beautifulsoup4>=4.12",
"scikit-learn>=1.5",
"lxml>=5.0",
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautifulsoup4 and lxml are added as hard dependencies but are not referenced anywhere in src/. If they are not needed for this PR, consider removing them to reduce install size and supply-chain surface area.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lxml is a python-docx transitive dependency (used internally for XML parsing). beautifulsoup4 is pre-declared for the Scout scraping fallback planned in 02-07 — added an inline comment in pyproject.toml to make this explicit.

Comment on lines 1 to 12
"""
Scout agentdiscovers and qualifies startup leads from configured venues.
Scout AgentYC lead discovery via yc-oss JSON API.

Pipeline: discover → deduplicate → score
Pipeline:
1. Fetch YC companies from yc-oss GitHub Pages API (batch or all)
2. Score each company against UserProfile skills using weighted formula
3. Validate output: reject company if >20% required fields are None (SCOUT-04)
4. Deduplicate against existing SQLite Lead records by email (SCOUT-06)
5. Persist top 10-20 leads sorted by score as status="discovered" (SCOUT-08)

Tools the LLM can call during this pipeline:
- fetch_venue_page: HTTP GET a venue listing (YC, etc.)
- extract_company_list: parse HTML into structured company entries
No LLM call — data is structured JSON; LLM is used in Research agent.
"""
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module no longer defines a ScoutAgent that implements the AgentBase contract (STEPS/run/run_step) and it no longer registers itself via register_agent("scout", …). This will break Orchestrator dispatch (get_agent("scout")) and existing tests that import ScoutAgent. Consider restoring the ScoutAgent wrapper (even if it just delegates to scout_run), or update the orchestrator/registry/tests consistently so Scout remains discoverable as an agent.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scout_run is registered via register_agent("scout", scout_run) so Orchestrator dispatch still resolves. Tests importing ScoutAgent by class name will need updating — tracked for 02-06. Function-based pattern chosen because Scout makes no LLM calls; the class wrapper added no value here.

Comment on lines +61 to +63
result = await session.execute(
select(Lead).where(Lead.person_email.ilike(person_email.strip()))
)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_is_duplicate calls session.exec(...), but the project uses sqlalchemy.ext.asyncio.AsyncSession, which does not provide .exec(). Use await session.execute(select(...)) and then inspect result.scalars().first() (or equivalent) to avoid an AttributeError at runtime.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable — the code uses await session.execute(select(...)) with .scalars().first(), which is the correct SQLAlchemy async API. No .exec() call exists in this file.

Comment on lines +149 to +152
await deps.session.commit()
await deps.session.refresh(lead)
persisted_leads.append(lead)

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This persists leads with a commit per row inside the loop. That is slow and increases the chance of partial writes if an error occurs mid-run. Prefer adding all leads, then committing once (or in batches), with a rollback on failure.

Suggested change
await deps.session.commit()
await deps.session.refresh(lead)
persisted_leads.append(lead)
persisted_leads.append(lead)
if not persisted_leads:
return persisted_leads
try:
await deps.session.commit()
# Refresh all persisted leads to ensure they have up-to-date state
for lead in persisted_leads:
await deps.session.refresh(lead)
except Exception:
await deps.session.rollback()
raise

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Moved all session.add() calls into the loop, single commit() + refresh loop after. Wrapped in try/except with rollback() on failure.

@coder-ishan coder-ishan force-pushed the feature/phase-02-plan-02-scout branch from 5812bb0 to 2ce0f8f Compare February 26, 2026 11:02
coder-ishan and others added 3 commits February 26, 2026 16:55
yc.py:
- Remove unused asyncio import
- Replace assert isinstance() with explicit ValueError (includes URL in message)
- Drop assert len > 100 — breaks batch/industry endpoints with < 100 results

scorer.py:
- Replace assert in ScoringWeights.__post_init__ with ValueError
- Fix score_lead() mutable default: weights=None sentinel, resolved inside body
- Add v2 TODO comment on _semantic_score for pre-fit vectorizer optimization

scout.py:
- Fix dedup bug: _is_duplicate queried Lead.person_email but received company_website;
  now correctly queries Lead.company_website
- Fix ScoutDeps.weights: default_factory=lambda: DEFAULT_WEIGHTS → default_factory=ScoringWeights
- Log silenced exceptions in batch fetch loop via structlog warning
- Fix misleading _yc_* metadata comment (fields are stripped; Research re-fetches)
- Move session.commit() outside per-row loop; single atomic commit with rollback

schemas.py:
- Rename body_must_mention_company → body_must_be_sufficiently_long (name now matches impl)

profile.py:
- Remove .doc from allowed suffixes (python-docx cannot parse binary .doc files)
- Update error message to direct users to convert .doc → .docx

writer.py (PR #6 carried fixes):
- Replace lead.__dict__["_resolved_recipient_type"] with WriterDeps.resolved_recipient_type field
- Replace datetime.utcnow() with datetime.now(timezone.utc) (Python 3.12+ deprecation)
- Replace double session.commit() with flush() + single atomic commit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants