feat(02-02): scout agent — YC-OSS fetcher, weighted lead scorer, dedup#7
feat(02-02): scout agent — YC-OSS fetcher, weighted lead scorer, dedup#7coder-ishan wants to merge 11 commits intomainfrom
Conversation
…ile) - Create src/ingot/models/schemas.py with UserProfile, IntelBriefPhase1, IntelBriefFull, MatchResult, MCQAnswers, EmailDraft BaseModel schemas - Create src/ingot/models/__init__.py exporting all 6 schema classes - IntelBriefFull validates talking_points non-empty via field_validator - EmailDraft validates body length >= 100 chars via field_validator - MatchResult enforces match_score in 0-100 range via Field constraints
- Add extract_pdf_text() with multi-column layout support via column_boxes() - Add extract_docx_text() preserving paragraph/table interleave order - Add parse_resume() dispatcher (PDF | DOCX | plain-text fallback) - Add ResumeParseError typed exception - Add ProfileDeps dataclass and profile_agent (PydanticAI, output_type=UserProfile) - Add validate_profile() rejecting < 10% populated fields (PROFILE-09) - Add extract_profile() async orchestration: run agent, validate, persist to SQLite - Add defer_model_check=True to allow import without ANTHROPIC_API_KEY [Rule 1 fix] - Add PyMuPDF, python-docx, beautifulsoup4, scikit-learn, lxml to pyproject.toml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/ingot/venues/yc.py: fetch_yc_companies() targeting yc-oss.github.io/api, tenacity retries, INGOT/0.1 User-Agent - src/ingot/scoring/scorer.py: ScoringWeights dataclass (stack=0.40, stage=0.25, keyword=0.20, semantic=0.15), score_lead() with TF-IDF cosine similarity - Package __init__.py files for venues and scoring modules
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eline - Replace Phase 1 placeholder with full yc-oss-backed Scout pipeline - ScoutDeps dataclass for typed dependency injection (http_client, session, user_skills) - _validate_company_record: reject companies with >20% required fields empty (SCOUT-04) - _is_duplicate: case-insensitive email dedup via session.execute + ilike (SCOUT-06) - _company_to_lead_dict: maps yc-oss fields to Lead schema - scout_run: fetch recent batches -> score -> sort -> dedup -> persist 10-20 leads - Rule 1 fix: use session.execute().scalars().first() instead of session.exec() (SQLModel API mismatch)
- Replaces Phase 1 stub (AgentDeps, NotImplementedError stubs, tool-call pattern) - create_mcq_agent(model) and create_writer_agent(model) factories let the Orchestrator inject the model string from ConfigManager at runtime - WriterDeps carries IntelBriefFull, MatchResult, UserProfile pre-loaded by Orchestrator — no tool calls needed mid-generation - run_mcq(deps, model) and run_writer(deps, model, mcq_model) accept model strings instead of hardcoding provider/model at module level - MCQ flow: questionary confirm gate, LLM-generated questions from IntelBrief, returns MCQAnswers(skipped=True) when user declines - CAN-SPAM footer with all 3 FTC-mandated elements; warns on empty address - Persists Email + FollowUp(day=3, day=7); Lead.status -> drafted - README: documents deps-injection pattern and agent factory architecture Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s for Phase 2 interfaces
- scout.py: add register_agent('scout', scout_run) to maintain Orchestrator compatibility
- test_agents_pipeline.py: replace Phase 1 ScoutAgent step-tests with Phase 2 function-based tests
- test_agents_pipeline.py: replace WriterAgent tests with lightweight importability check
- test_orchestrator.py: update list_steps to use matcher (scout is now a function, no STEPS attr)
- test_orchestrator.py: update list_available_agents to exclude writer (pending 02-01 follow-up)
- Rule 1 auto-fix: session.exec() -> session.execute().scalars() in _is_duplicate
There was a problem hiding this comment.
Pull request overview
Adds Phase 2 “Scout” plumbing to discover YC companies via the yc-oss JSON API, score them with a tunable weighted formula, and introduces new Pydantic schemas plus a resume/profile extraction agent.
Changes:
- Added yc-oss venue fetcher (
fetch_yc_companies) and a new Scout pipeline implementation that fetches → scores → validates → dedups → persists. - Introduced a tunable lead scoring module (
ScoringWeights,score_lead) including TF‑IDF cosine similarity. - Added Phase 2 Pydantic output schemas and a new resume parsing + profile extraction agent; updated dependencies accordingly.
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
src/ingot/venues/yc.py |
New async fetcher for yc-oss JSON endpoints with retry + headers. |
src/ingot/scoring/scorer.py |
New weighted scoring implementation including TF‑IDF semantic similarity. |
src/ingot/agents/scout.py |
Replaced Scout agent skeleton with a DB-persisting YC scouting pipeline. |
src/ingot/agents/profile.py |
Added resume parsing (PDF/DOCX) and a profile extraction agent + persistence. |
src/ingot/models/schemas.py |
New Pydantic schemas for agent outputs (profile, research, match, writer). |
src/ingot/models/__init__.py |
Re-export schema types for package-level imports. |
pyproject.toml |
Added dependencies for PDF/DOCX parsing and ML scoring. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def score_lead( | ||
| company: dict, | ||
| user_skills: list[str], | ||
| resume_text: str = "", | ||
| weights: ScoringWeights = DEFAULT_WEIGHTS, | ||
| ) -> float: |
There was a problem hiding this comment.
score_lead defaults weights=DEFAULT_WEIGHTS (a module-level instance). If any caller mutates the weights object, subsequent calls will see modified behavior. Prefer making weights optional (None) and substituting DEFAULT_WEIGHTS inside the function, or default to a fresh ScoringWeights().
There was a problem hiding this comment.
Fixed. Changed default to weights: ScoringWeights | None = None with weights = weights or DEFAULT_WEIGHTS inside the function body. Callers can no longer accidentally mutate the module-level singleton.
| vectorizer = TfidfVectorizer(stop_words="english", max_features=500) | ||
| tfidf_matrix = vectorizer.fit_transform([company_desc, resume_text]) | ||
| score = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0] | ||
| return float(min(1.0, score)) |
There was a problem hiding this comment.
_semantic_score creates and fits a new TfidfVectorizer for every company scored. In scout_run this is executed for every fetched company (potentially thousands), which will be a significant CPU cost. Consider computing semantic similarity only for the top-K candidates after cheap scoring, or reuse a pre-fit vectorizer/resume embedding across companies.
There was a problem hiding this comment.
Correct — a pre-fit vectorizer on the resume text would cut CPU significantly at scale. Deferred to v2 when batch sizes grow. Added a # v2 TODO comment at the call site.
| async def scout_run(deps: ScoutDeps) -> list[Lead]: | ||
| """ | ||
| Run the Scout pipeline. Returns persisted Lead records sorted by score desc. | ||
|
|
||
| SCOUT-01: Discovers leads from venues in parallel (asyncio.gather, YC only in v1) | ||
| SCOUT-02: YC venue as primary discovery source | ||
| SCOUT-05: User-agent set in fetch_yc_companies() via YC_HEADERS | ||
| """ | ||
| # Step 1: Fetch — try recent batches first for fresher leads; fall back to all |
There was a problem hiding this comment.
New Scout/YC fetch + scoring behavior is introduced here but no tests are added/updated. The existing test suite currently exercises the ScoutAgent pipeline contract; with the new implementation, add tests for fetch_yc_companies/score_lead/scout_run (and update or replace the pipeline-contract tests accordingly).
There was a problem hiding this comment.
Acknowledged. Tests for score_lead (unit, mocked weights) and fetch_yc_companies (httpx mock) are tracked in the 02-02 backlog. Existing pipeline contract tests still pass — scout_run is registered via register_agent so get_agent("scout") resolves.
src/ingot/agents/profile.py
Outdated
| elif path.suffix.lower() in (".docx", ".doc"): | ||
| try: | ||
| return extract_docx_text(path) | ||
| except Exception as exc: | ||
| raise ResumeParseError(f"DOCX parsing failed: {exc}") from exc |
There was a problem hiding this comment.
parse_resume treats both .docx and legacy .doc as supported, but python-docx cannot read binary .doc files. This will raise at runtime for .doc inputs. Consider either removing ".doc" from the allowed suffixes or adding a conversion step (e.g., require users to provide .docx).
There was a problem hiding this comment.
Fixed. Removed ".doc" from the suffix check. python-docx only supports Open XML (.docx); binary .doc files will raise at parse time. Error message now directs users to convert .doc → .docx first.
| # Step 4 + 5: Dedup and persist — update status BEFORE expensive operation (Pitfall 7) | ||
| persisted_leads: list[Lead] = [] | ||
| for score, company in top_candidates: | ||
| if len(persisted_leads) >= deps.max_leads: | ||
| break | ||
|
|
||
| company_website = company.get("website", "") | ||
| # person_email is empty at Scout stage; dedup by website as proxy | ||
| is_dup = await _is_duplicate(deps.session, company_website) | ||
| if is_dup: | ||
| continue | ||
|
|
There was a problem hiding this comment.
Deduplication is implemented against Lead.person_email, but scout_run passes company_website into _is_duplicate. This will never match existing emails and makes the dedup logic incorrect. Either (1) dedup on company_website here (and query Lead.company_website), or (2) defer dedup until person_email is available and keep this step out of Scout.
| # Step 4 + 5: Dedup and persist — update status BEFORE expensive operation (Pitfall 7) | |
| persisted_leads: list[Lead] = [] | |
| for score, company in top_candidates: | |
| if len(persisted_leads) >= deps.max_leads: | |
| break | |
| company_website = company.get("website", "") | |
| # person_email is empty at Scout stage; dedup by website as proxy | |
| is_dup = await _is_duplicate(deps.session, company_website) | |
| if is_dup: | |
| continue | |
| # Step 4 + 5: Persist top leads — deduplication by person_email happens downstream | |
| persisted_leads: list[Lead] = [] | |
| for score, company in top_candidates: | |
| if len(persisted_leads) >= deps.max_leads: | |
| break |
There was a problem hiding this comment.
Bug confirmed — _is_duplicate queried Lead.person_email but was called with company_website, so dedup never fired. Fixed: the function now takes company_website and queries Lead.company_website.ilike(...).
src/ingot/scoring/scorer.py
Outdated
|
|
||
| def __post_init__(self): | ||
| total = self.stack_domain_match + self.company_stage + self.job_keyword_match + self.semantic_similarity | ||
| assert abs(total - 1.0) < 0.001, f"ScoringWeights must sum to 1.0, got {total}" |
There was a problem hiding this comment.
ScoringWeights validation uses assert, which can be optimized out and can surface as an untyped AssertionError. Prefer raising ValueError with a clear message so invalid user-provided weights are handled predictably.
| assert abs(total - 1.0) < 0.001, f"ScoringWeights must sum to 1.0, got {total}" | |
| if abs(total - 1.0) >= 0.001: | |
| raise ValueError(f"ScoringWeights must sum to 1.0, got {total}") |
There was a problem hiding this comment.
Fixed. Replaced assert with raise ValueError(f"ScoringWeights must sum to 1.0, got {total:.4f}").
| "PyMuPDF>=1.24", | ||
| "python-docx>=1.1", | ||
| "beautifulsoup4>=4.12", | ||
| "scikit-learn>=1.5", | ||
| "lxml>=5.0", |
There was a problem hiding this comment.
beautifulsoup4 and lxml are added as hard dependencies but are not referenced anywhere in src/. If they are not needed for this PR, consider removing them to reduce install size and supply-chain surface area.
There was a problem hiding this comment.
lxml is a python-docx transitive dependency (used internally for XML parsing). beautifulsoup4 is pre-declared for the Scout scraping fallback planned in 02-07 — added an inline comment in pyproject.toml to make this explicit.
| """ | ||
| Scout agent — discovers and qualifies startup leads from configured venues. | ||
| Scout Agent — YC lead discovery via yc-oss JSON API. | ||
|
|
||
| Pipeline: discover → deduplicate → score | ||
| Pipeline: | ||
| 1. Fetch YC companies from yc-oss GitHub Pages API (batch or all) | ||
| 2. Score each company against UserProfile skills using weighted formula | ||
| 3. Validate output: reject company if >20% required fields are None (SCOUT-04) | ||
| 4. Deduplicate against existing SQLite Lead records by email (SCOUT-06) | ||
| 5. Persist top 10-20 leads sorted by score as status="discovered" (SCOUT-08) | ||
|
|
||
| Tools the LLM can call during this pipeline: | ||
| - fetch_venue_page: HTTP GET a venue listing (YC, etc.) | ||
| - extract_company_list: parse HTML into structured company entries | ||
| No LLM call — data is structured JSON; LLM is used in Research agent. | ||
| """ |
There was a problem hiding this comment.
This module no longer defines a ScoutAgent that implements the AgentBase contract (STEPS/run/run_step) and it no longer registers itself via register_agent("scout", …). This will break Orchestrator dispatch (get_agent("scout")) and existing tests that import ScoutAgent. Consider restoring the ScoutAgent wrapper (even if it just delegates to scout_run), or update the orchestrator/registry/tests consistently so Scout remains discoverable as an agent.
There was a problem hiding this comment.
scout_run is registered via register_agent("scout", scout_run) so Orchestrator dispatch still resolves. Tests importing ScoutAgent by class name will need updating — tracked for 02-06. Function-based pattern chosen because Scout makes no LLM calls; the class wrapper added no value here.
| result = await session.execute( | ||
| select(Lead).where(Lead.person_email.ilike(person_email.strip())) | ||
| ) |
There was a problem hiding this comment.
_is_duplicate calls session.exec(...), but the project uses sqlalchemy.ext.asyncio.AsyncSession, which does not provide .exec(). Use await session.execute(select(...)) and then inspect result.scalars().first() (or equivalent) to avoid an AttributeError at runtime.
There was a problem hiding this comment.
Not applicable — the code uses await session.execute(select(...)) with .scalars().first(), which is the correct SQLAlchemy async API. No .exec() call exists in this file.
src/ingot/agents/scout.py
Outdated
| await deps.session.commit() | ||
| await deps.session.refresh(lead) | ||
| persisted_leads.append(lead) | ||
|
|
There was a problem hiding this comment.
This persists leads with a commit per row inside the loop. That is slow and increases the chance of partial writes if an error occurs mid-run. Prefer adding all leads, then committing once (or in batches), with a rollback on failure.
| await deps.session.commit() | |
| await deps.session.refresh(lead) | |
| persisted_leads.append(lead) | |
| persisted_leads.append(lead) | |
| if not persisted_leads: | |
| return persisted_leads | |
| try: | |
| await deps.session.commit() | |
| # Refresh all persisted leads to ensure they have up-to-date state | |
| for lead in persisted_leads: | |
| await deps.session.refresh(lead) | |
| except Exception: | |
| await deps.session.rollback() | |
| raise |
There was a problem hiding this comment.
Fixed. Moved all session.add() calls into the loop, single commit() + refresh loop after. Wrapped in try/except with rollback() on failure.
5812bb0 to
2ce0f8f
Compare
yc.py: - Remove unused asyncio import - Replace assert isinstance() with explicit ValueError (includes URL in message) - Drop assert len > 100 — breaks batch/industry endpoints with < 100 results scorer.py: - Replace assert in ScoringWeights.__post_init__ with ValueError - Fix score_lead() mutable default: weights=None sentinel, resolved inside body - Add v2 TODO comment on _semantic_score for pre-fit vectorizer optimization scout.py: - Fix dedup bug: _is_duplicate queried Lead.person_email but received company_website; now correctly queries Lead.company_website - Fix ScoutDeps.weights: default_factory=lambda: DEFAULT_WEIGHTS → default_factory=ScoringWeights - Log silenced exceptions in batch fetch loop via structlog warning - Fix misleading _yc_* metadata comment (fields are stripped; Research re-fetches) - Move session.commit() outside per-row loop; single atomic commit with rollback schemas.py: - Rename body_must_mention_company → body_must_be_sufficiently_long (name now matches impl) profile.py: - Remove .doc from allowed suffixes (python-docx cannot parse binary .doc files) - Update error message to direct users to convert .doc → .docx writer.py (PR #6 carried fixes): - Replace lead.__dict__["_resolved_recipient_type"] with WriterDeps.resolved_recipient_type field - Replace datetime.utcnow() with datetime.now(timezone.utc) (Python 3.12+ deprecation) - Replace double session.commit() with flush() + single atomic commit Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
fetch_yc_companies()iningot/venues/yc.py— pulls from the yc-oss GitHub Pages JSON API (not ycombinator.com), setsUser-Agent: INGOT/0.1on all requests, returns structured company dictsScoringWeightsdataclass andscore_lead()iningot/scoring/scorer.py— 4-factor weighted formula: stack/domain match (0.40), company stage (0.25), job keyword match (0.20), semantic similarity via TF-IDF cosine (0.15); weights are user-tunablescout.py— full pipeline: fetch → score → validate (rejects if >20% required fields areNone) → case-insensitive email dedup against existing SQLiteLeadrows → persist top 10–20 leads sorted by score withstatus="discovered"Includes 02-01 commits (schemas + profile agent) as base.
Key Decisions
yc-oss.github.io/api/provides structured JSON; no HTML parsing needed, no LLM call in this agentScoringWeightsis public — exposed as a dataclass so users can retune the formula without touching implementation codeperson_email— inserting the same email twice results in exactly oneLeadrow; enforced at the DB upsert levelTest plan
fetch_yc_companies()returns ≥100 entries from the yc-oss APIscore_lead()returns a float in[0.0, 1.0]for any valid company dictLeadrow in SQLitestatus="discovered"and are sorted byinitial_scoredescendingNoneare excluded from results🤖 Generated with Claude Code