Three-tier populate agent: triage-extract + investigate subagents#91
Three-tier populate agent: triage-extract + investigate subagents#91MMeteorL wants to merge 6 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
First batch: 5 searches → top 5 URLs → 5 parallel extract_rows. Subsequent batches: up to 20 searches (from leads + new angles) → top 10 URLs → up to 10 parallel extract_rows → list_rows → repeat. Mirrors the original investigate_row dispatch discipline but scaled for the triage-extract architecture. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR refactors the dataset populate pipeline from a single-row investigate-based approach to a batch-oriented triage→extract orchestration. It introduces buildExtractTool (extract_rows, list_rows, investigate_entity, insert_row, update_row_by_key) with shared run state and confidence-based updates, threads a configurable targetRows cap through tooling, adds a prune step for incomplete rows before dataset-ready notifications, adapts the investigate agent to update existing rows by key, and enriches search results with optional site_name. Sequence DiagramsequenceDiagram
participant PopulateOrch as Populate Orchestrator
participant SearchWeb as search_web
participant ExtractRows as extract_rows
participant TriageAgent as TriageExtractAgent
participant InvestigateEnt as investigate_entity
participant ConvexDB as Convex
PopulateOrch->>SearchWeb: search for dataset topics
SearchWeb->>PopulateOrch: return results with site_name
PopulateOrch->>ExtractRows: batch promising URLs
ExtractRows->>ConvexDB: refresh rowIndex (capture parallel updates)
ExtractRows->>TriageAgent: triage URL and extract entities
TriageAgent->>ConvexDB: insert_row (new entity)
TriageAgent->>ConvexDB: update_row_by_key (higher confidence)
TriageAgent->>InvestigateEnt: investigate_entity (fill blanks)
InvestigateEnt->>ConvexDB: update_row_by_key (findings)
ExtractRows->>ConvexDB: list_rows (report progress)
ExtractRows->>PopulateOrch: return triage_status
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/mastra/agents/investigate.ts (1)
10-65:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThis refactor breaks the repo’s investigate-subagent contract.
The instructions and tool wiring now enforce an
update_row_by_key-only agent, but this file is still documented to expose the four-tool investigate surface (insert_row,list_rows,search_web,fetch_page) frombuildInvestigateAgent(authorizedDatasetId, authContext, columns). If this contract is intentionally changing, the repository guideline and any dependent callers/tests need to move in the same PR; otherwise this agent should keep the required interface.
As per coding guidelines, "Use buildInvestigateAgent(authorizedDatasetId, authContext, columns) factory to build per-entity subagents with captured dataset id for capability scoping" and "Investigate subagent must have exactly 4 tools:insert_row,list_rows,search_web,fetch_pageand must return structured feedback (INSERTED/SUMMARY/CLUES/REASON)".Also applies to: 79-93
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/agents/investigate.ts` around lines 10 - 65, The refactor changed the investigate agent's behavior to only call update_row_by_key, breaking the required investigate-subagent contract; restore the original 4-tool interface and structured feedback by updating buildInvestigateInstructions and the agent factory to document and expose the four tools (insert_row, list_rows, search_web, fetch_page) and ensure the agent implementation created by buildInvestigateAgent(authorizedDatasetId, authContext, columns) uses those tools (not just update_row_by_key) and returns the exact final structured output (INSERTED/SUMMARY/CLUES/REASON); if the contract is intentionally changing, instead update all dependent callers/tests and repository guidelines consistently to reflect the new single-tool behavior but do not leave this file inconsistent.
🧹 Nitpick comments (1)
backend/package.json (1)
30-30: Confirmtypescript@5.8.3validity/security; flag non-security compiler regressions
typescript@5.8.3is a published npm version; with^5.8.3it will stay on the5.xline (won’t float to6.x).- No known security vulnerabilities/CVEs are reported for
typescript@5.8.3in common vulnerability feeds.- Reported issues for TS 5.8.3 include compiler/typecheck crashes/regressions (reliability, not security); if CI/build failures occur, consider upgrading within the
5.xline.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/package.json` at line 30, The package.json currently specifies "typescript": "^5.8.3"; confirm this version is acceptable by (1) checking official npm/CVE/advisory feeds for any reported security vulnerabilities for typescript@5.8.3 and documenting that none were found, (2) running the project CI/build and full test/typecheck to detect any compiler/regression issues caused by 5.8.3, and (3) if build/typecheck failures occur, update package.json to a newer safe 5.x release (and update lockfile) rather than jumping to 6.x; if everything passes, leave "^5.8.3" but note in the PR that regressions are a known non-security risk.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/src/env.ts`:
- Around line 28-31: POPULATE_TARGET_ROWS is being set via
Number(process.env.BIGSET_POPULATE_TARGET_ROWS||"20") without validation, which
can yield NaN or non-positive values; update the initialization so you parse and
validate BIGSET_POPULATE_TARGET_ROWS (use parseInt or Number), ensure
Number.isInteger(value) and value > 0, and if invalid either fallback to the
default 20 or throw a clear error/log (modify the POPULATE_TARGET_ROWS
assignment and add validation logic around
process.env.BIGSET_POPULATE_TARGET_ROWS to enforce a positive integer).
In `@backend/src/mastra/agents/populate.ts`:
- Around line 12-72: The current orchestrator text in
buildOrchestratorInstructions changes the agent surface by referencing
extract_rows and list_rows and delegating fetches/writes; revert it to the
repository contract by using buildPopulateAgent(authorizedDatasetId,
authContext, columns) to construct the orchestrator and ensure the instructions
only reference the three authorized tools: search_web, fetch_page, and
investigate_row (remove references to extract_rows, list_rows, and any
delegation that performs fetches or writes); update any logic or stop-condition
wording to rely on list_rows/fetch_page/investigate_row behavior prescribed by
buildPopulateAgent and preserve the search-only boundary.
In `@backend/src/mastra/tools/investigate-tool.ts`:
- Around line 155-183: The insert path can overshoot targetRows because the
extract_rows check runs elsewhere; inside the execute handler for insert_row
(the async execute with params { primary_key, confidence, sources, data }) add a
guard right before calling convex.mutation(internal.datasetRows.insert, ...)
that checks the current completed count (use rowIndex.size or the same
completed-row counter you maintain) against targetRows and aborts the insert if
the cap is reached (log via logCtx and return a failure/skip result). Ensure you
perform this check before the mutation and only call rowIndex.set(primary_key,
...) after a successful insert; reference: execute, primary_key, targetRows,
rowIndex, convex.mutation(internal.datasetRows.insert), authorizedDatasetId.
- Around line 431-452: The investigate subagent is being spawned with
buildInvestigateAgent and limited to maxSteps: 20; update the launch to use
buildInvestigateTool(authorizedDatasetId, authContext, columns) and call its
generate with maxSteps: 25 instead of 20, then pass the resulting text into
parseInvestigateOutput as before (ensure you're replacing buildInvestigateAgent
and the agent.generate(..., { maxSteps: 20 }) call with the
buildInvestigateTool(...) invocation and generate(..., { maxSteps: 25 }) so the
subagent conforms to the required 25-step budget).
In `@frontend/convex/datasetRows.ts`:
- Around line 199-202: The completeness check in isComplete currently tests
every name in args.columnNames against data, but the contract excludes fields
starting with '_' from being required; update the logic in the isComplete
computation to first filter args.columnNames to exclude any column name that
starts with '_' (e.g., col.startsWith('_')) and then run the every(...) check
against data so underscore-prefixed fields are never considered required when
determining completeness.
---
Outside diff comments:
In `@backend/src/mastra/agents/investigate.ts`:
- Around line 10-65: The refactor changed the investigate agent's behavior to
only call update_row_by_key, breaking the required investigate-subagent
contract; restore the original 4-tool interface and structured feedback by
updating buildInvestigateInstructions and the agent factory to document and
expose the four tools (insert_row, list_rows, search_web, fetch_page) and ensure
the agent implementation created by buildInvestigateAgent(authorizedDatasetId,
authContext, columns) uses those tools (not just update_row_by_key) and returns
the exact final structured output (INSERTED/SUMMARY/CLUES/REASON); if the
contract is intentionally changing, instead update all dependent callers/tests
and repository guidelines consistently to reflect the new single-tool behavior
but do not leave this file inconsistent.
---
Nitpick comments:
In `@backend/package.json`:
- Line 30: The package.json currently specifies "typescript": "^5.8.3"; confirm
this version is acceptable by (1) checking official npm/CVE/advisory feeds for
any reported security vulnerabilities for typescript@5.8.3 and documenting that
none were found, (2) running the project CI/build and full test/typecheck to
detect any compiler/regression issues caused by 5.8.3, and (3) if
build/typecheck failures occur, update package.json to a newer safe 5.x release
(and update lockfile) rather than jumping to 6.x; if everything passes, leave
"^5.8.3" but note in the PR that regressions are a known non-security risk.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 32a5a17c-21f2-4a1e-a10e-f543d896dc3b
⛔ Files ignored due to path filters (1)
backend/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (10)
backend/package.jsonbackend/src/env.tsbackend/src/index.tsbackend/src/mastra/agents/investigate.tsbackend/src/mastra/agents/populate.tsbackend/src/mastra/agents/triage-extract.tsbackend/src/mastra/tools/investigate-tool.tsbackend/src/mastra/tools/web-tools.tsbackend/src/mastra/workflows/populate.tsfrontend/convex/datasetRows.ts
Remove the hard top-5 / top-10 count limits per batch. Instead, dispatch every URL that clears a quality bar (relevance, data value, source authority, novelty against already-complete rows). Consolidate the redundant second/subsequent batch sections into a single loop. Steer searches using list_rows output to avoid re-searching complete entities. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/mastra/agents/populate.ts (1)
91-98:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden
targetRowsparsing to prevent disabled caps on invalid env values.
Number(process.env.BIGSET_POPULATE_TARGET_ROWS || "20")can yieldNaN/0/negative values, which can break cap/stop behavior and allow runaway extraction.💡 Minimal fix
+const DEFAULT_TARGET_ROWS = (() => { + const parsed = Number(process.env.BIGSET_POPULATE_TARGET_ROWS); + return Number.isFinite(parsed) && parsed > 0 ? Math.floor(parsed) : 20; +})(); + export function buildPopulateAgent( authorizedDatasetId: string, authContext: AuthContext, columns: PopulateColumn[], - targetRows: number = Number(process.env.BIGSET_POPULATE_TARGET_ROWS || "20"), + targetRows: number = DEFAULT_TARGET_ROWS, ): Agent {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/agents/populate.ts` around lines 91 - 98, The targetRows env parsing is unsafe and can produce NaN/0/negative values which disable extraction caps; update the parsing for targetRows (the variable passed into buildExtractTool) to robustly parse and validate the environment value (use parseInt/Number, ensure it's a finite positive integer, floor it) and fall back to the default 20 when the parsed value is invalid or <= 0; optionally clamp to a sensible maximum before passing into buildExtractTool to avoid runaway extraction.
♻️ Duplicate comments (1)
backend/src/mastra/agents/populate.ts (1)
31-56:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftRestore the orchestrator’s required tool surface (
search_web,fetch_page,investigate_row).This change moves orchestration to
extract_rows/list_rows, which violates the required populate-agent contract and removes the mandated no-write boundary at the orchestrator layer.As per coding guidelines, "Populate orchestrator agent must have exactly 3 tools:
search_web,fetch_page,investigate_rowwith no write access".Also applies to: 105-109
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/agents/populate.ts` around lines 31 - 56, The orchestrator removed the mandated tool surface and must be restored to expose exactly the three read-only tools: search_web, fetch_page, and investigate_row and must not perform any write operations itself; revert or refactor the current orchestration flow that moved responsibilities into extract_rows/list_rows so that populate (or the populate-agent orchestrator function) only calls search_web to get URL candidates, calls fetch_page to retrieve page content, and calls investigate_row to analyze content and emit leads, and ensure extract_rows/list_rows remain downstream workers (or are called by investigate_row) rather than replacing the orchestrator tools; locate usages of extract_rows, list_rows, and the populate-orchestrator function and change them so the orchestrator exposes and calls the three tool functions (search_web, fetch_page, investigate_row) and contains no writes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@backend/src/mastra/agents/populate.ts`:
- Around line 91-98: The targetRows env parsing is unsafe and can produce
NaN/0/negative values which disable extraction caps; update the parsing for
targetRows (the variable passed into buildExtractTool) to robustly parse and
validate the environment value (use parseInt/Number, ensure it's a finite
positive integer, floor it) and fall back to the default 20 when the parsed
value is invalid or <= 0; optionally clamp to a sensible maximum before passing
into buildExtractTool to avoid runaway extraction.
---
Duplicate comments:
In `@backend/src/mastra/agents/populate.ts`:
- Around line 31-56: The orchestrator removed the mandated tool surface and must
be restored to expose exactly the three read-only tools: search_web, fetch_page,
and investigate_row and must not perform any write operations itself; revert or
refactor the current orchestration flow that moved responsibilities into
extract_rows/list_rows so that populate (or the populate-agent orchestrator
function) only calls search_web to get URL candidates, calls fetch_page to
retrieve page content, and calls investigate_row to analyze content and emit
leads, and ensure extract_rows/list_rows remain downstream workers (or are
called by investigate_row) rather than replacing the orchestrator tools; locate
usages of extract_rows, list_rows, and the populate-orchestrator function and
change them so the orchestrator exposes and calls the three tool functions
(search_web, fetch_page, investigate_row) and contains no writes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b0a10655-2fca-4ffc-82f7-b3c3f0950c14
📒 Files selected for processing (1)
backend/src/mastra/agents/populate.ts
- env.ts: validate BIGSET_POPULATE_TARGET_ROWS — NaN, zero, and negative values now fall back to the default of 20 instead of silently breaking the agent's stop condition. - datasetRows.mergeUpdate (new Convex internalMutation): atomic per-field blank-aware merge. Blank cells are always filled with any non-empty incoming value regardless of confidence; non-blank cells are only overwritten when newConfidence > existing row confidence. The authoritative check lives in Convex so two concurrent investigate agents can't both pass a stale client-side confidence check and race to write — the compare-and-merge is serialized inside a single transaction. Quota is charged only on actual changes; history entries are recorded per changed field. - investigate-tool.ts / buildUpdateRowByKeyTool: drop the row-wide confidence pre-check (which blocked filling blank columns on partial high-confidence rows) and call mergeUpdate instead of update. The local rowIndex is updated optimistically using the same per-field rules so subsequent calls within the run stay consistent without an extra Convex round-trip. - datasetRows.deleteIncomplete: skip _-prefixed keys (_confidence, _sources, etc.) in the completeness check — they are internal fields that are never required columns. - backend/CLAUDE.md: rewrite Mastra section to reflect the tri-agent architecture and new confidence/merge semantics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
One textual conflict in investigate-tool.ts: commit 1e9cc8b on main ("Unify local env loading") fixed a redundant `?? false` in the old parseInvestigateResult return block. Our branch had already replaced that entire function with parseTriageExtractOutput and parseInvestigateOutput. Git's heuristic matched the `return { ... }` structure across the two different functions and flagged a false conflict. Resolution: keep HEAD (our new parsers are correct; the old function no longer exists). All other files merged cleanly: - backend/src/env.ts: main's dotenv path consolidation + our POPULATE_TARGET_ROWS validation both land without conflict. - frontend/convex/datasetRows.ts: main's rowCount denormalized counter (insert/remove/clearByDataset) merges alongside our mergeUpdate mutation and deleteIncomplete fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #85 ("Make populate status truthful") refactored the /populate route from synchronous (awaiting the workflow inline) to async: it now claims the dataset, creates a run, fires runPopulateWorkflowInBackground() as a background void promise, and returns 202 immediately. Our branch still had the old synchronous pattern with the post-workflow notification and status-update logic inline in the route handler — including our deleteIncomplete pruning addition. The conflict was the entire old inline block vs main's single 202 return. Resolution: - Take main's 202 return — the async pattern is strictly better. - Move the deleteIncomplete pruning into runPopulateWorkflowInBackground() (already introduced by main), placed before the countByDataset call so we count only complete rows when deciding whether to set status "live". All other files (populate.ts workflow, frontend status/schema changes) merged cleanly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/src/env.ts (1)
37-40:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix
POPULATE_TARGET_ROWSvalidation to reject non-integers before flooring.Current logic accepts fractional values and can produce
0(e.g.,0.5 -> 0), which can prematurely terminate populate runs. It also contradicts the comment that non-integers fallback to20.💡 Suggested patch
POPULATE_TARGET_ROWS: (() => { const parsed = Number(process.env.BIGSET_POPULATE_TARGET_ROWS); - return Number.isFinite(parsed) && parsed > 0 ? Math.floor(parsed) : 20; + return Number.isInteger(parsed) && parsed > 0 ? parsed : 20; })(),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/env.ts` around lines 37 - 40, POPULATE_TARGET_ROWS currently accepts fractional numbers because it checks Number.isFinite(parsed) then floors, which lets values like 0.5 become 0; change the validation so only integer positive values are accepted (use Number.isInteger(parsed) && parsed > 0) and return parsed as-is (or Math.floor only if you want to coerce negatives handled) otherwise fall back to 20; update the initialization for POPULATE_TARGET_ROWS in env.ts to perform Number.isInteger(parsed) && parsed > 0 before using the value.frontend/convex/datasetRows.ts (1)
334-360:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
deleteIncompletedoes not update the cachedrowCount, causing stale dashboard counts.
insert,remove, andclearByDatasetall maintain the denormalizedrowCountfield on the dataset document. However,deleteIncompletecallsctx.db.delete(row._id)directly without decrementing the counter. After pruning, the dashboard will display a higher row count than actually exists until the next mutation that self-heals.🐛 Proposed fix: update rowCount after deleting rows
export const deleteIncomplete = internalMutation({ args: { datasetId: v.id("datasets"), columnNames: v.array(v.string()), }, handler: async (ctx, args) => { const rows = await ctx.db .query("datasetRows") .withIndex("by_dataset", (q) => q.eq("datasetId", args.datasetId)) .collect(); let deletedCount = 0; for (const row of rows) { const data = row.data as Record<string, unknown>; const isComplete = args.columnNames.every((col) => { if (col.startsWith("_")) return true; // skip internal fields (_confidence, _sources, etc.) const val = data[col]; return val !== null && val !== undefined && val !== ""; }); if (!isComplete) { await ctx.db.delete(row._id); deletedCount++; } } + + // Update the cached rowCount to stay in sync with actual rows. + if (deletedCount > 0) { + const dataset = await ctx.db.get(args.datasetId); + if (dataset) { + const newCount = + typeof dataset.rowCount === "number" + ? Math.max(0, dataset.rowCount - deletedCount) + : rows.length - deletedCount; // self-heal if rowCount was missing + await ctx.db.patch(args.datasetId, { rowCount: newCount }); + } + } + return { deletedCount }; }, });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/convex/datasetRows.ts` around lines 334 - 360, The deleteIncomplete mutation removes rows but never updates the dataset's denormalized rowCount, leaving stale counts; modify deleteIncomplete to mirror the behavior of insert/remove/clearByDataset by decrementing the dataset document's rowCount after deletions (use the same DB update/patch API used by insert/remove to subtract deletedCount for args.datasetId), or update the rowCount incrementally as you delete each row so the dataset's rowCount stays consistent with the actual rows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@backend/src/env.ts`:
- Around line 37-40: POPULATE_TARGET_ROWS currently accepts fractional numbers
because it checks Number.isFinite(parsed) then floors, which lets values like
0.5 become 0; change the validation so only integer positive values are accepted
(use Number.isInteger(parsed) && parsed > 0) and return parsed as-is (or
Math.floor only if you want to coerce negatives handled) otherwise fall back to
20; update the initialization for POPULATE_TARGET_ROWS in env.ts to perform
Number.isInteger(parsed) && parsed > 0 before using the value.
In `@frontend/convex/datasetRows.ts`:
- Around line 334-360: The deleteIncomplete mutation removes rows but never
updates the dataset's denormalized rowCount, leaving stale counts; modify
deleteIncomplete to mirror the behavior of insert/remove/clearByDataset by
decrementing the dataset document's rowCount after deletions (use the same DB
update/patch API used by insert/remove to subtract deletedCount for
args.datasetId), or update the rowCount incrementally as you delete each row so
the dataset's rowCount stays consistent with the actual rows.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fa06ba5d-6848-4387-adb5-d3e8897a9229
📒 Files selected for processing (5)
backend/package.jsonbackend/src/env.tsbackend/src/index.tsbackend/src/mastra/workflows/populate.tsfrontend/convex/datasetRows.ts
✅ Files skipped from review due to trivial changes (1)
- backend/package.json
Summary
Restructures the AI data-collection pipeline from a two-tier system (orchestrator → investigate) into a three-tier system (orchestrator → triage-extract → investigate), improving data quality, reducing per-agent context load, and creating a structured extension point for future browser/TinyFish agent routing.
Architecture before → after
Before
After
Feature changes
1. Triage-Extract Agent —
agents/triage-extract.ts(new)Replaces the previous multi-URL extract agent. Each call receives exactly one URL. Its first job is triage: after fetching the page it classifies it into one of five statuses before deciding what to do next.
extract_nowneeds_browser_agentneeds_form_filllow_valueblockedWhy the triage step matters beyond skipping bad pages: breaking each URL into its own short-lived agent keeps context windows small (one fetch ≈ 15 K chars instead of accumulating across multiple fetches), which was the primary cause of the previous agent hitting its step cap before writing any rows. More importantly, the five statuses create a structured dispatch table for future agent routing — pages classified as
needs_browser_agentorneeds_form_fillcan be handed off to a TinyFish agent or Playwright browser agent in a future iteration without changing either the orchestrator or the investigate layer. The triage result is returned to the orchestrator as a structured field, giving that routing path a clean hook to attach to.After classifying a page as
extract_now, the agent reads the full page first, identifies all matching entities, then batch-inserts all rows, then callsinvestigate_entityfor each row with missing columns — avoiding interleaved read/write step churn.2. Investigate Agent —
agents/investigate.ts(restored + adapted)Restored to the autonomous single-entity researcher pattern from the main branch. Key adaptation: the row already exists (inserted by the triage-extract agent), so the investigate agent works exclusively with
update_row_by_key, targeting specific missing columns. It runs parallel web searches, fetches the most promising pages, and fills the gaps — the same deep-research quality as the original, now triggered precisely when and where data is missing.3. Populate Orchestrator —
agents/populate.ts(updated)Search-dispatch cadence:
extract_rowscalls →list_rowslist_rows→ repeatQuality threshold replaces fixed count caps: instead of "select top 5" or "select top 10," the orchestrator dispatches every URL that clears a quality bar — relevance, data value, source authority, and novelty against already-complete rows. This prevents valuable URLs from being permanently discarded just because they didn't rank in an arbitrary top-N. The stagnant-batch stop condition (2 consecutive batches with no new complete rows) bounds total work without needing an artificial per-batch limit.
Search steering via
list_rows: after each batch the orchestrator reads the current dataset state and steers subsequent searches toward entity types that are still missing or incomplete, avoiding wasted work on already-complete rows.Other orchestrator improvements:
${currentMonth} ${currentYear}) so queries for "recent" or "current" data use the actual date, not training-data defaultsleadsfield fromextract_rowsreplaces structured JSON arrays, reducing schema brittleness4. Shared rowIndex +
list_rows—tools/investigate-tool.tsbuildExtractToolnow returns{ extractRowsTool, listRowsTool }— both backed by the same in-memoryrowIndexclosure.list_rowstherefore always reflects real-time state (including rows written by concurrentinvestigate_entitysub-calls) without a Convex round-trip. TheBIGSET_POPULATE_TARGET_ROWScap is enforced both in the orchestrator stop condition and as a hard early-exit inextract_rows.5. Post-workflow row pruning —
index.ts,datasetRows.tsAfter the populate workflow finishes, incomplete rows (any column blank) are deleted before the dataset is marked live. The dataset card and table therefore only ever show fully-populated rows. The prune is best-effort — a failure logs a warning but does not block the dataset-ready notification.
6.
POPULATE_TARGET_ROWSenv var —env.tsHard cap on complete rows per populate run (default: 20). Override with
BIGSET_POPULATE_TARGET_ROWS=Nin the root.env.Test plan
make convex-pushto deploydeleteIncompleteandlistInternalto the self-hosted Convex instance:4111) shows all three agent tiers in the workflow trace🤖 Generated with Claude Code