feat(02-05): email writer agent — MCQ, A/B subjects, tone routing, CAN-SPAM#6
feat(02-05): email writer agent — MCQ, A/B subjects, tone routing, CAN-SPAM#6coder-ishan wants to merge 6 commits intomainfrom
Conversation
…ile) - Create src/ingot/models/schemas.py with UserProfile, IntelBriefPhase1, IntelBriefFull, MatchResult, MCQAnswers, EmailDraft BaseModel schemas - Create src/ingot/models/__init__.py exporting all 6 schema classes - IntelBriefFull validates talking_points non-empty via field_validator - EmailDraft validates body length >= 100 chars via field_validator - MatchResult enforces match_score in 0-100 range via Field constraints
- Add extract_pdf_text() with multi-column layout support via column_boxes() - Add extract_docx_text() preserving paragraph/table interleave order - Add parse_resume() dispatcher (PDF | DOCX | plain-text fallback) - Add ResumeParseError typed exception - Add ProfileDeps dataclass and profile_agent (PydanticAI, output_type=UserProfile) - Add validate_profile() rejecting < 10% populated fields (PROFILE-09) - Add extract_profile() async orchestration: run agent, validate, persist to SQLite - Add defer_model_check=True to allow import without ANTHROPIC_API_KEY [Rule 1 fix] - Add PyMuPDF, python-docx, beautifulsoup4, scikit-learn, lxml to pyproject.toml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ent Task 1 - WriterDeps dataclass with sender/address fields for CAN-SPAM - mcq_agent (Haiku) generates 2-3 context-specific questions from IntelBriefFull - run_mcq() optional flow with questionary.confirm() skip gate - build_can_spam_footer() with all 3 FTC-mandated elements; warns on empty address - _TONE_PROMPTS dict for hr/cto/ceo/default with meaningfully different length+structure rules - Both agents use defer_model_check=True (codebase convention) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ence — Task 2 - writer_agent (Sonnet) with output_type=EmailDraft and inject_writer_context system prompt - Tone detection from person_role: hr/cto/ceo/default routing - run_writer() persists Email + FollowUp(day=3) + FollowUp(day=7); Lead.status -> drafted - retrigger_mcq=True path for WRITER-13 reject/regenerate flow - CAN-SPAM footer built inline and appended to email body before persistence Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR implements the email writer agent with MCQ (Multiple Choice Questions) personalization flow, tone-based email generation, CAN-SPAM compliance, and database persistence. It includes both the writer pipeline (02-05) and the profile agent from plan 02-01 as a foundation.
Changes:
- Adds complete profile agent with PDF/DOCX parsing and PydanticAI-based resume extraction
- Implements two-agent writer pipeline: mcq_agent (Haiku) for question generation and writer_agent (Sonnet) for email drafting
- Implements tone-aware email generation with role-based routing (HR/CTO/CEO/default)
- Adds CAN-SPAM compliant footer with sender identity, physical address, and unsubscribe mechanism
- Creates Pydantic schemas for all Phase 2 agent outputs (UserProfile, IntelBriefFull, EmailDraft, etc.)
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ingot/models/schemas.py | Defines Pydantic output schemas for all Phase 2 agents (UserProfile, IntelBriefFull, MatchResult, MCQAnswers, EmailDraft) |
| src/ingot/models/init.py | Exports all schema classes from the models package |
| src/ingot/agents/writer.py | Complete rewrite implementing MCQ agent, writer agent, tone routing, CAN-SPAM footer, and database persistence |
| src/ingot/agents/profile.py | New module with PDF/DOCX parsing, PydanticAI profile extraction, validation, and persistence |
| pyproject.toml | Adds dependencies for document parsing (PyMuPDF, python-docx) and ML/scraping (beautifulsoup4, scikit-learn, lxml) |
| .planning/phases/02-core-pipeline-scout-through-writer/02-05-SUMMARY.md | Documents implementation details and requirements coverage |
| .planning/STATE.md | Updates project state to reflect 02-05 completion |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| recipient_type = "default" | ||
|
|
||
| tone_guidance = _TONE_PROMPTS[recipient_type] | ||
| deps.lead.__dict__["_resolved_recipient_type"] = recipient_type # stored for persistence |
There was a problem hiding this comment.
The variable lead.__dict__["_resolved_recipient_type"] uses private dict manipulation to store state on the Lead SQLModel instance. This is fragile and bypasses the ORM layer. The underscore prefix suggests this is meant to be temporary/internal, but the value is later read from the same dict in run_writer() (line 328).
Consider either:
- Adding a proper field to the Lead model if this needs to be persisted
- Storing this in a local variable or the WriterDeps if it's only needed during the current operation
- Returning it as part of the EmailDraft if it needs to be passed between functions
| deps.lead.__dict__["_resolved_recipient_type"] = recipient_type # stored for persistence | |
| setattr(deps.lead, "_resolved_recipient_type", recipient_type) # stored for later use |
There was a problem hiding this comment.
Agreed. lead.__dict__[...] is fragile and bypasses the ORM. Fixing by adding _resolved_recipient_type: str = "default" as a field on WriterDeps directly — that's the right place for transient pipeline state.
| "PyMuPDF>=1.24", | ||
| "python-docx>=1.1", | ||
| "beautifulsoup4>=4.12", | ||
| "scikit-learn>=1.5", |
There was a problem hiding this comment.
The dependency scikit-learn>=1.5 is added but not used in any of the changed files in this PR. Consider either:
- Removing it if it's not needed yet
- Documenting in the PR description or comments why it's being added preemptively
- Or verifying that it's used in other parts of the codebase not visible in this diff
Adding unused dependencies increases installation size and potential security surface area.
There was a problem hiding this comment.
Push back — sklearn is used in src/ingot/scoring/scorer.py (TfidfVectorizer + cosine_similarity for the Phase 2 scoring formula). Not visible in this PR's diff since scorer.py was added in 02-02, but it's an active dependency.
| @field_validator("talking_points") | ||
| @classmethod | ||
| def at_least_one_talking_point(cls, v: list[str]) -> list[str]: | ||
| if not v: | ||
| raise ValueError("IntelBriefFull must have at least one talking point") | ||
| return v | ||
|
|
||
|
|
There was a problem hiding this comment.
The talking_points field has redundant validation - both a Field constraint min_length=1, max_length=3 (line 46) and a custom validator that checks for empty lists (lines 49-54). The Field constraints already enforce that there must be at least 1 item, making the custom validator's empty check redundant.
Consider removing the custom validator since the Field constraints are sufficient, or clarify in a comment why both are needed (e.g., if Field constraints produce different error messages than the custom validator).
| @field_validator("talking_points") | |
| @classmethod | |
| def at_least_one_talking_point(cls, v: list[str]) -> list[str]: | |
| if not v: | |
| raise ValueError("IntelBriefFull must have at least one talking point") | |
| return v |
There was a problem hiding this comment.
Valid. Field(min_length=1) already enforces non-empty. Removing the custom validator.
| async def run_mcq(deps: WriterDeps) -> MCQAnswers: | ||
| """ | ||
| Run the optional MCQ flow for personalization. | ||
|
|
||
| LOCKED DECISION (02-CONTEXT.md): | ||
| - MCQ is optional — confirm with user before running | ||
| - If skipped, returns MCQAnswers(answers={}, skipped=True) | ||
| - When run, questions are LLM-generated from IntelBriefFull (not hardcoded) | ||
| - Question types: personalization hook + tone/intent + optional experience connection | ||
|
|
||
| Returns MCQAnswers with answers dict (question -> answer) and skipped flag. | ||
| """ | ||
| run_mcq_flag = questionary.confirm( | ||
| f"Run personalization questions for {deps.intel_brief.company_name}? " | ||
| f"(recommended, or press Enter to skip)", | ||
| default=True, | ||
| ).ask() | ||
|
|
||
| if not run_mcq_flag: | ||
| return MCQAnswers(answers={}, skipped=True) | ||
|
|
||
| # LLM generates questions from IntelBriefFull | ||
| result = await mcq_agent.run( | ||
| "Generate personalized MCQ questions for this lead's outreach email.", | ||
| deps=deps, | ||
| ) | ||
| questions: list[str] = result.output.questions | ||
|
|
||
| # Collect answers interactively | ||
| answers: dict[str, str] = {} | ||
| for q in questions: | ||
| answer = questionary.text(q, default="").ask() | ||
| if answer and answer.strip(): | ||
| answers[q] = answer.strip() | ||
|
|
||
| return MCQAnswers(answers=answers, skipped=False) |
There was a problem hiding this comment.
The run_mcq() function uses questionary.confirm() and questionary.text() which block for user input. This function is declared as async (line 98) but contains blocking synchronous I/O operations, which defeats the purpose of async and could cause performance issues in an async application.
Consider either:
- Making the function synchronous (remove
async) since it performs blocking I/O - Or using async-compatible CLI input methods if available
- Or documenting that this function is intentionally blocking despite being async
The issue is that while the function itself is async (it awaits the LLM call), the questionary calls will block the entire event loop, preventing other async operations from running.
There was a problem hiding this comment.
The function must stay async — it awaits mcq_agent.run() which is a PydanticAI async call. There's no async-compatible questionary alternative with the same API. The blocking questionary calls are expected in a CLI tool context (not a web server). Adding a docstring note to make this explicit.
| "aioimaplib>=2", | ||
| "PyMuPDF>=1.24", | ||
| "python-docx>=1.1", | ||
| "beautifulsoup4>=4.12", |
There was a problem hiding this comment.
The dependency beautifulsoup4>=4.12 is added but not imported or used in any of the changed files in this PR. Consider either:
- Removing it if it's not needed yet
- Documenting why it's being added preemptively
- Or verifying that it's used in other parts of the codebase
BeautifulSoup is typically used for HTML parsing, which might be for web scraping in the Scout or Research agents.
| "beautifulsoup4>=4.12", |
There was a problem hiding this comment.
bs4 is a phase-locked dependency documented in 02-RESEARCH.md (SCOUT-03: 'YC scraping strategy: httpx + BS4 fallback'). It's listed in 02-01-PLAN.md as a phase-wide dependency, added at the phase boundary in 02-01 for use in the Scout agent (02-02). Dependencies are added by phase, not per-agent file.
| @field_validator("body") | ||
| @classmethod | ||
| def body_must_mention_company(cls, v: str, info: object) -> str: | ||
| # Validated post-generation: body must reference something specific. | ||
| # This is a soft check — passes unless body is suspiciously short. | ||
| if len(v) < 100: | ||
| raise ValueError("Email body is too short to be personalized (< 100 chars)") | ||
| return v |
There was a problem hiding this comment.
The field validator body_must_mention_company has a misleading name and inadequate implementation. It only checks if the body is longer than 100 characters, but doesn't actually verify that the company name is mentioned. Consider either:
- Renaming to
body_must_have_minimum_lengthto match what it actually does - Or implementing the actual check: verify that
company_namefrom the IntelBriefFull appears in the body text
Note: The validator receives an info parameter but doesn't use it to access other fields for cross-field validation.
There was a problem hiding this comment.
Valid. Renaming to body_must_have_minimum_length to match what it actually validates.
|
|
||
| @field_validator("body") | ||
| @classmethod | ||
| def body_must_mention_company(cls, v: str, info: object) -> str: |
There was a problem hiding this comment.
The info parameter in the validator is typed as object which provides no type safety or IDE support. In Pydantic v2, field validators receive a ValidationInfo parameter. Consider updating the type hint to ValidationInfo from pydantic or pydantic.types for better type safety and access to validation context.
There was a problem hiding this comment.
Valid. Updating to pydantic.ValidationInfo.
| await deps.session.refresh(email_record) | ||
|
|
||
| # Persist Day 3 follow-up (WRITER-09, DB-06) | ||
| followup_day3 = FollowUp( | ||
| parent_email_id=email_record.id, | ||
| scheduled_for_day=3, | ||
| body=draft.followup_day3, | ||
| status=FollowUpStatus.queued, | ||
| created_at=datetime.utcnow(), | ||
| ) | ||
| # Persist Day 7 follow-up (WRITER-09, DB-06) | ||
| followup_day7 = FollowUp( | ||
| parent_email_id=email_record.id, |
There was a problem hiding this comment.
After the first commit at line 342, email_record.id is used at line 347 to set parent_email_id for follow-ups. However, between line 342 and 343, there's an await session.refresh(email_record) which ensures the ID is populated.
The issue is that if the refresh fails or if the ID assignment fails in the database, lines 347 and 355 would use None for parent_email_id, creating follow-ups that aren't linked to any email. Consider adding an assertion or validation after line 343 to ensure email_record.id is not None before proceeding.
There was a problem hiding this comment.
Valid. Adding assert email_record.id is not None, "DB failed to assign email.id after flush" after the flush before using the ID.
| # on the installed version. Try both; fall back to None (single-column). | ||
| try: | ||
| from pymupdf import column_boxes | ||
| except ImportError: | ||
| try: | ||
| from pymupdf.utils import column_boxes # type: ignore[no-redef] | ||
| except ImportError: | ||
| column_boxes = None # type: ignore[assignment] |
There was a problem hiding this comment.
The fallback handling for column_boxes import uses type: ignore[no-redef] (line 57) and type: ignore[assignment] (line 59) to suppress type checker warnings. While the logic is sound (trying multiple import paths), the type annotations could be improved.
Consider defining column_boxes with a proper type hint at the module level (e.g., Optional[Callable]) and removing the type ignores, or use a more explicit protocol/type stub for the expected function signature. This would make the code more maintainable and provide better IDE support.
There was a problem hiding this comment.
The try/except structure is self-documenting and works correctly. Not adding a protocol stub for an internal fallback import path.
| match_result: MatchResult | ||
| lead: Lead | ||
| session: AsyncSession | ||
| mcq_answers: MCQAnswers = field(default_factory=lambda: MCQAnswers(answers={}, skipped=True)) |
There was a problem hiding this comment.
The default value for mcq_answers in WriterDeps uses a lambda (line 45), but this creates a new MCQAnswers instance at field definition time, not at instance creation time. In dataclasses with mutable defaults, this is typically a pattern to avoid shared mutable state, but field(default_factory=...) is the more idiomatic approach.
Consider changing to:
mcq_answers: MCQAnswers = field(default_factory=MCQAnswers)This assumes MCQAnswers(answers={}, skipped=True) is the default constructor behavior, which it is based on the schema definition (line 65-69).
| mcq_answers: MCQAnswers = field(default_factory=lambda: MCQAnswers(answers={}, skipped=True)) | |
| mcq_answers: MCQAnswers = field(default_factory=MCQAnswers) |
There was a problem hiding this comment.
Push back. The code already uses field(default_factory=lambda: MCQAnswers(answers={}, skipped=True)) — this IS the field(default_factory=...) pattern. The lambda is required because MCQAnswers has no defaults for answers and skipped. The suggested replacement field(default_factory=MCQAnswers) would raise TypeError: MCQAnswers.__init__() missing 2 required positional arguments. The reviewer's description of the current code as creating an instance 'at field definition time' is also incorrect — default_factory always creates a new instance per object construction.
yc.py: - Remove unused asyncio import - Replace assert isinstance() with explicit ValueError (includes URL in message) - Drop assert len > 100 — breaks batch/industry endpoints with < 100 results scorer.py: - Replace assert in ScoringWeights.__post_init__ with ValueError - Fix score_lead() mutable default: weights=None sentinel, resolved inside body - Add v2 TODO comment on _semantic_score for pre-fit vectorizer optimization scout.py: - Fix dedup bug: _is_duplicate queried Lead.person_email but received company_website; now correctly queries Lead.company_website - Fix ScoutDeps.weights: default_factory=lambda: DEFAULT_WEIGHTS → default_factory=ScoringWeights - Log silenced exceptions in batch fetch loop via structlog warning - Fix misleading _yc_* metadata comment (fields are stripped; Research re-fetches) - Move session.commit() outside per-row loop; single atomic commit with rollback schemas.py: - Rename body_must_mention_company → body_must_be_sufficiently_long (name now matches impl) profile.py: - Remove .doc from allowed suffixes (python-docx cannot parse binary .doc files) - Update error message to direct users to convert .doc → .docx writer.py (PR #6 carried fixes): - Replace lead.__dict__["_resolved_recipient_type"] with WriterDeps.resolved_recipient_type field - Replace datetime.utcnow() with datetime.now(timezone.utc) (Python 3.12+ deprecation) - Replace double session.commit() with flush() + single atomic commit Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- profile.py: remove stale inline `from docx import Document` inside extract_docx_text (already top-level) - profile.py: datetime.utcnow() → datetime.now(timezone.utc) in extract_profile - profile.py: add TODO(02-06) comment to model_override param docstring - tests: replace WriterAgent-based TestWriterPipeline with factory-API imports test (run_writer, WriterDeps, create_writer_agent) and inspect.iscoroutinefunction check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
mcq_agent(Haiku) that generates 2–3 context-specific questions fromIntelBriefFull— questions reference actual company/person context, not hardcoded templates; run is optional with aquestionary.confirm()skip gatewriter_agent(Sonnet,output_type=EmailDraft) with role-based tone routing: HR emails are longer with credentials foregrounded; CTO/CEO emails are short and direct; unknown defaults to short/directbuild_can_spam_footer()with all 3 FTC-mandated elements (sender identity, physical address, unsubscribe mechanism);run_writer()persistsEmail+ twoFollowUprecords (day=3, day=7) and advancesLead.status → draftedIncludes 02-01 commits (schemas + profile agent) as base.
Key Decisions
person_rolestring;_TONE_PROMPTSdict drives structurally different system prompt injections per recipient typeretrigger_mcq=Truepath (WRITER-13) — reject/regenerate with "different angle" re-runs MCQ before re-calling writer, same DB sessionIntelBrief + match dataalone if user skips; no forced interactionTest plan
company_nameorperson_rolefrom theIntelBriefFull, not generic placeholdersEmailDraft.can_spam_footercontains physical address and unsubscribe textFollowUprows (day=3, day=7) created in DB afterrun_writer()Lead.status == "drafted"after successful writer runquestionary.confirm()→ No) still produces a completeEmailDraft🤖 Generated with Claude Code