Skip to content

Three-tier populate agent: triage-extract + investigate subagents#91

Open
MMeteorL wants to merge 6 commits into
mainfrom
mastra_agent_system_update
Open

Three-tier populate agent: triage-extract + investigate subagents#91
MMeteorL wants to merge 6 commits into
mainfrom
mastra_agent_system_update

Conversation

@MMeteorL
Copy link
Copy Markdown
Collaborator

@MMeteorL MMeteorL commented May 25, 2026

Summary

Restructures the AI data-collection pipeline from a two-tier system (orchestrator → investigate) into a three-tier system (orchestrator → triage-extract → investigate), improving data quality, reducing per-agent context load, and creating a structured extension point for future browser/TinyFish agent routing.

Architecture before → after

Before

Populate Orchestrator  (search_web · fetch_page · investigate_row)
    └─▶ Investigate Agent × N   (search + fetch + insert — one entity per call)

After

Populate Orchestrator  (search_web · extract_rows · list_rows)
    └─▶ Triage-Extract Agent × N   (fetch_page · insert_row · update_row_by_key · investigate_entity)
             └─▶ Investigate Agent × M   (search_web · fetch_page · update_row_by_key)

Feature changes

1. Triage-Extract Agent — agents/triage-extract.ts (new)

Replaces the previous multi-URL extract agent. Each call receives exactly one URL. Its first job is triage: after fetching the page it classifies it into one of five statuses before deciding what to do next.

Triage status Meaning
extract_now Readable content with matching entities — proceed to extraction
needs_browser_agent JS-rendered SPA, login wall, or requires real browser interaction
needs_form_fill Search form or user input required before data appears
low_value Accessible but no entities relevant to the dataset
blocked 403 / 404 / paywall / CAPTCHA

Why the triage step matters beyond skipping bad pages: breaking each URL into its own short-lived agent keeps context windows small (one fetch ≈ 15 K chars instead of accumulating across multiple fetches), which was the primary cause of the previous agent hitting its step cap before writing any rows. More importantly, the five statuses create a structured dispatch table for future agent routing — pages classified as needs_browser_agent or needs_form_fill can be handed off to a TinyFish agent or Playwright browser agent in a future iteration without changing either the orchestrator or the investigate layer. The triage result is returned to the orchestrator as a structured field, giving that routing path a clean hook to attach to.

After classifying a page as extract_now, the agent reads the full page first, identifies all matching entities, then batch-inserts all rows, then calls investigate_entity for each row with missing columns — avoiding interleaved read/write step churn.

2. Investigate Agent — agents/investigate.ts (restored + adapted)

Restored to the autonomous single-entity researcher pattern from the main branch. Key adaptation: the row already exists (inserted by the triage-extract agent), so the investigate agent works exclusively with update_row_by_key, targeting specific missing columns. It runs parallel web searches, fetches the most promising pages, and fills the gaps — the same deep-research quality as the original, now triggered precisely when and where data is missing.

3. Populate Orchestrator — agents/populate.ts (updated)

Search-dispatch cadence:

  • First batch: 5 parallel searches → dispatch all qualifying URLs as parallel extract_rows calls → list_rows
  • All subsequent batches: up to 20 parallel searches (combining leads from extract agents with new angles) → dispatch all qualifying URLs → list_rows → repeat

Quality threshold replaces fixed count caps: instead of "select top 5" or "select top 10," the orchestrator dispatches every URL that clears a quality bar — relevance, data value, source authority, and novelty against already-complete rows. This prevents valuable URLs from being permanently discarded just because they didn't rank in an arbitrary top-N. The stagnant-batch stop condition (2 consecutive batches with no new complete rows) bounds total work without needing an artificial per-batch limit.

Search steering via list_rows: after each batch the orchestrator reads the current dataset state and steers subsequent searches toward entity types that are still missing or incomplete, avoiding wasted work on already-complete rows.

Other orchestrator improvements:

  • Live date injection (${currentMonth} ${currentYear}) so queries for "recent" or "current" data use the actual date, not training-data defaults
  • Natural language leads field from extract_rows replaces structured JSON arrays, reducing schema brittleness
  • URL deduplication: the orchestrator tracks every dispatched URL and never re-sends

4. Shared rowIndex + list_rowstools/investigate-tool.ts

buildExtractTool now returns { extractRowsTool, listRowsTool } — both backed by the same in-memory rowIndex closure. list_rows therefore always reflects real-time state (including rows written by concurrent investigate_entity sub-calls) without a Convex round-trip. The BIGSET_POPULATE_TARGET_ROWS cap is enforced both in the orchestrator stop condition and as a hard early-exit in extract_rows.

5. Post-workflow row pruning — index.ts, datasetRows.ts

After the populate workflow finishes, incomplete rows (any column blank) are deleted before the dataset is marked live. The dataset card and table therefore only ever show fully-populated rows. The prune is best-effort — a failure logs a warning but does not block the dataset-ready notification.

6. POPULATE_TARGET_ROWS env var — env.ts

Hard cap on complete rows per populate run (default: 20). Override with BIGSET_POPULATE_TARGET_ROWS=N in the root .env.


Test plan

  • Run make convex-push to deploy deleteIncomplete and listInternal to the self-hosted Convex instance
  • Trigger a populate run via the UI and confirm rows appear in real-time via Convex reactive queries
  • Confirm that incomplete rows are absent from the final dataset table (pruned post-workflow)
  • Verify the Mastra Studio (:4111) shows all three agent tiers in the workflow trace
  • Test with a time-sensitive prompt (e.g. "YC W26 companies") and confirm search queries include the current year/month

🤖 Generated with Claude Code

MMeteorL and others added 2 commits May 25, 2026 10:26
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
First batch: 5 searches → top 5 URLs → 5 parallel extract_rows.
Subsequent batches: up to 20 searches (from leads + new angles) →
top 10 URLs → up to 10 parallel extract_rows → list_rows → repeat.
Mirrors the original investigate_row dispatch discipline but scaled
for the triage-extract architecture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors the dataset populate pipeline from a single-row investigate-based approach to a batch-oriented triage→extract orchestration. It introduces buildExtractTool (extract_rows, list_rows, investigate_entity, insert_row, update_row_by_key) with shared run state and confidence-based updates, threads a configurable targetRows cap through tooling, adds a prune step for incomplete rows before dataset-ready notifications, adapts the investigate agent to update existing rows by key, and enriches search results with optional site_name.

Sequence Diagram

sequenceDiagram
  participant PopulateOrch as Populate Orchestrator
  participant SearchWeb as search_web
  participant ExtractRows as extract_rows
  participant TriageAgent as TriageExtractAgent
  participant InvestigateEnt as investigate_entity
  participant ConvexDB as Convex
  
  PopulateOrch->>SearchWeb: search for dataset topics
  SearchWeb->>PopulateOrch: return results with site_name
  PopulateOrch->>ExtractRows: batch promising URLs
  ExtractRows->>ConvexDB: refresh rowIndex (capture parallel updates)
  ExtractRows->>TriageAgent: triage URL and extract entities
  TriageAgent->>ConvexDB: insert_row (new entity)
  TriageAgent->>ConvexDB: update_row_by_key (higher confidence)
  TriageAgent->>InvestigateEnt: investigate_entity (fill blanks)
  InvestigateEnt->>ConvexDB: update_row_by_key (findings)
  ExtractRows->>ConvexDB: list_rows (report progress)
  ExtractRows->>PopulateOrch: return triage_status
Loading

Possibly related PRs

  • tinyfish-io/bigset#81: Refactors populate-agent/tool wiring around investigate/extract and update_row_by_key changes affecting the same components.
  • tinyfish-io/bigset#26: Earlier implementation of the AI populate pipeline; this PR refactors the same /populate flow and tooling.
  • tinyfish-io/bigset#85: Related changes to the populate background/notification flow; both PRs modify populate completion handling and dataset-ready behavior.

Suggested reviewers

  • simantak-dabhade
  • manav-tf
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main architectural change: restructuring from a two-tier to a three-tier populate agent system with triage-extract and investigate subagents.
Description check ✅ Passed The description is thorough and clearly related to the changeset, detailing the new three-tier architecture, each agent's responsibilities, configuration changes, and test plan.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mastra_agent_system_update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/src/mastra/agents/investigate.ts (1)

10-65: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

This refactor breaks the repo’s investigate-subagent contract.

The instructions and tool wiring now enforce an update_row_by_key-only agent, but this file is still documented to expose the four-tool investigate surface (insert_row, list_rows, search_web, fetch_page) from buildInvestigateAgent(authorizedDatasetId, authContext, columns). If this contract is intentionally changing, the repository guideline and any dependent callers/tests need to move in the same PR; otherwise this agent should keep the required interface.
As per coding guidelines, "Use buildInvestigateAgent(authorizedDatasetId, authContext, columns) factory to build per-entity subagents with captured dataset id for capability scoping" and "Investigate subagent must have exactly 4 tools: insert_row, list_rows, search_web, fetch_page and must return structured feedback (INSERTED/SUMMARY/CLUES/REASON)".

Also applies to: 79-93

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/agents/investigate.ts` around lines 10 - 65, The refactor
changed the investigate agent's behavior to only call update_row_by_key,
breaking the required investigate-subagent contract; restore the original 4-tool
interface and structured feedback by updating buildInvestigateInstructions and
the agent factory to document and expose the four tools (insert_row, list_rows,
search_web, fetch_page) and ensure the agent implementation created by
buildInvestigateAgent(authorizedDatasetId, authContext, columns) uses those
tools (not just update_row_by_key) and returns the exact final structured output
(INSERTED/SUMMARY/CLUES/REASON); if the contract is intentionally changing,
instead update all dependent callers/tests and repository guidelines
consistently to reflect the new single-tool behavior but do not leave this file
inconsistent.
🧹 Nitpick comments (1)
backend/package.json (1)

30-30: Confirm typescript@5.8.3 validity/security; flag non-security compiler regressions

  • typescript@5.8.3 is a published npm version; with ^5.8.3 it will stay on the 5.x line (won’t float to 6.x).
  • No known security vulnerabilities/CVEs are reported for typescript@5.8.3 in common vulnerability feeds.
  • Reported issues for TS 5.8.3 include compiler/typecheck crashes/regressions (reliability, not security); if CI/build failures occur, consider upgrading within the 5.x line.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/package.json` at line 30, The package.json currently specifies
"typescript": "^5.8.3"; confirm this version is acceptable by (1) checking
official npm/CVE/advisory feeds for any reported security vulnerabilities for
typescript@5.8.3 and documenting that none were found, (2) running the project
CI/build and full test/typecheck to detect any compiler/regression issues caused
by 5.8.3, and (3) if build/typecheck failures occur, update package.json to a
newer safe 5.x release (and update lockfile) rather than jumping to 6.x; if
everything passes, leave "^5.8.3" but note in the PR that regressions are a
known non-security risk.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/env.ts`:
- Around line 28-31: POPULATE_TARGET_ROWS is being set via
Number(process.env.BIGSET_POPULATE_TARGET_ROWS||"20") without validation, which
can yield NaN or non-positive values; update the initialization so you parse and
validate BIGSET_POPULATE_TARGET_ROWS (use parseInt or Number), ensure
Number.isInteger(value) and value > 0, and if invalid either fallback to the
default 20 or throw a clear error/log (modify the POPULATE_TARGET_ROWS
assignment and add validation logic around
process.env.BIGSET_POPULATE_TARGET_ROWS to enforce a positive integer).

In `@backend/src/mastra/agents/populate.ts`:
- Around line 12-72: The current orchestrator text in
buildOrchestratorInstructions changes the agent surface by referencing
extract_rows and list_rows and delegating fetches/writes; revert it to the
repository contract by using buildPopulateAgent(authorizedDatasetId,
authContext, columns) to construct the orchestrator and ensure the instructions
only reference the three authorized tools: search_web, fetch_page, and
investigate_row (remove references to extract_rows, list_rows, and any
delegation that performs fetches or writes); update any logic or stop-condition
wording to rely on list_rows/fetch_page/investigate_row behavior prescribed by
buildPopulateAgent and preserve the search-only boundary.

In `@backend/src/mastra/tools/investigate-tool.ts`:
- Around line 155-183: The insert path can overshoot targetRows because the
extract_rows check runs elsewhere; inside the execute handler for insert_row
(the async execute with params { primary_key, confidence, sources, data }) add a
guard right before calling convex.mutation(internal.datasetRows.insert, ...)
that checks the current completed count (use rowIndex.size or the same
completed-row counter you maintain) against targetRows and aborts the insert if
the cap is reached (log via logCtx and return a failure/skip result). Ensure you
perform this check before the mutation and only call rowIndex.set(primary_key,
...) after a successful insert; reference: execute, primary_key, targetRows,
rowIndex, convex.mutation(internal.datasetRows.insert), authorizedDatasetId.
- Around line 431-452: The investigate subagent is being spawned with
buildInvestigateAgent and limited to maxSteps: 20; update the launch to use
buildInvestigateTool(authorizedDatasetId, authContext, columns) and call its
generate with maxSteps: 25 instead of 20, then pass the resulting text into
parseInvestigateOutput as before (ensure you're replacing buildInvestigateAgent
and the agent.generate(..., { maxSteps: 20 }) call with the
buildInvestigateTool(...) invocation and generate(..., { maxSteps: 25 }) so the
subagent conforms to the required 25-step budget).

In `@frontend/convex/datasetRows.ts`:
- Around line 199-202: The completeness check in isComplete currently tests
every name in args.columnNames against data, but the contract excludes fields
starting with '_' from being required; update the logic in the isComplete
computation to first filter args.columnNames to exclude any column name that
starts with '_' (e.g., col.startsWith('_')) and then run the every(...) check
against data so underscore-prefixed fields are never considered required when
determining completeness.

---

Outside diff comments:
In `@backend/src/mastra/agents/investigate.ts`:
- Around line 10-65: The refactor changed the investigate agent's behavior to
only call update_row_by_key, breaking the required investigate-subagent
contract; restore the original 4-tool interface and structured feedback by
updating buildInvestigateInstructions and the agent factory to document and
expose the four tools (insert_row, list_rows, search_web, fetch_page) and ensure
the agent implementation created by buildInvestigateAgent(authorizedDatasetId,
authContext, columns) uses those tools (not just update_row_by_key) and returns
the exact final structured output (INSERTED/SUMMARY/CLUES/REASON); if the
contract is intentionally changing, instead update all dependent callers/tests
and repository guidelines consistently to reflect the new single-tool behavior
but do not leave this file inconsistent.

---

Nitpick comments:
In `@backend/package.json`:
- Line 30: The package.json currently specifies "typescript": "^5.8.3"; confirm
this version is acceptable by (1) checking official npm/CVE/advisory feeds for
any reported security vulnerabilities for typescript@5.8.3 and documenting that
none were found, (2) running the project CI/build and full test/typecheck to
detect any compiler/regression issues caused by 5.8.3, and (3) if
build/typecheck failures occur, update package.json to a newer safe 5.x release
(and update lockfile) rather than jumping to 6.x; if everything passes, leave
"^5.8.3" but note in the PR that regressions are a known non-security risk.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 32a5a17c-21f2-4a1e-a10e-f543d896dc3b

📥 Commits

Reviewing files that changed from the base of the PR and between 1e9cc8b and 4fac5bc.

⛔ Files ignored due to path filters (1)
  • backend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (10)
  • backend/package.json
  • backend/src/env.ts
  • backend/src/index.ts
  • backend/src/mastra/agents/investigate.ts
  • backend/src/mastra/agents/populate.ts
  • backend/src/mastra/agents/triage-extract.ts
  • backend/src/mastra/tools/investigate-tool.ts
  • backend/src/mastra/tools/web-tools.ts
  • backend/src/mastra/workflows/populate.ts
  • frontend/convex/datasetRows.ts

Comment thread backend/src/env.ts Outdated
Comment thread backend/src/mastra/agents/populate.ts
Comment thread backend/src/mastra/tools/investigate-tool.ts
Comment thread backend/src/mastra/tools/investigate-tool.ts
Comment thread backend/src/mastra/tools/investigate-tool.ts
Comment thread frontend/convex/datasetRows.ts
Remove the hard top-5 / top-10 count limits per batch. Instead, dispatch
every URL that clears a quality bar (relevance, data value, source
authority, novelty against already-complete rows). Consolidate the
redundant second/subsequent batch sections into a single loop. Steer
searches using list_rows output to avoid re-searching complete entities.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/src/mastra/agents/populate.ts (1)

91-98: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden targetRows parsing to prevent disabled caps on invalid env values.

Number(process.env.BIGSET_POPULATE_TARGET_ROWS || "20") can yield NaN/0/negative values, which can break cap/stop behavior and allow runaway extraction.

💡 Minimal fix
+const DEFAULT_TARGET_ROWS = (() => {
+  const parsed = Number(process.env.BIGSET_POPULATE_TARGET_ROWS);
+  return Number.isFinite(parsed) && parsed > 0 ? Math.floor(parsed) : 20;
+})();
+
 export function buildPopulateAgent(
   authorizedDatasetId: string,
   authContext: AuthContext,
   columns: PopulateColumn[],
-  targetRows: number = Number(process.env.BIGSET_POPULATE_TARGET_ROWS || "20"),
+  targetRows: number = DEFAULT_TARGET_ROWS,
 ): Agent {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/agents/populate.ts` around lines 91 - 98, The targetRows
env parsing is unsafe and can produce NaN/0/negative values which disable
extraction caps; update the parsing for targetRows (the variable passed into
buildExtractTool) to robustly parse and validate the environment value (use
parseInt/Number, ensure it's a finite positive integer, floor it) and fall back
to the default 20 when the parsed value is invalid or <= 0; optionally clamp to
a sensible maximum before passing into buildExtractTool to avoid runaway
extraction.
♻️ Duplicate comments (1)
backend/src/mastra/agents/populate.ts (1)

31-56: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Restore the orchestrator’s required tool surface (search_web, fetch_page, investigate_row).

This change moves orchestration to extract_rows/list_rows, which violates the required populate-agent contract and removes the mandated no-write boundary at the orchestrator layer.

As per coding guidelines, "Populate orchestrator agent must have exactly 3 tools: search_web, fetch_page, investigate_row with no write access".

Also applies to: 105-109

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/agents/populate.ts` around lines 31 - 56, The orchestrator
removed the mandated tool surface and must be restored to expose exactly the
three read-only tools: search_web, fetch_page, and investigate_row and must not
perform any write operations itself; revert or refactor the current
orchestration flow that moved responsibilities into extract_rows/list_rows so
that populate (or the populate-agent orchestrator function) only calls
search_web to get URL candidates, calls fetch_page to retrieve page content, and
calls investigate_row to analyze content and emit leads, and ensure
extract_rows/list_rows remain downstream workers (or are called by
investigate_row) rather than replacing the orchestrator tools; locate usages of
extract_rows, list_rows, and the populate-orchestrator function and change them
so the orchestrator exposes and calls the three tool functions (search_web,
fetch_page, investigate_row) and contains no writes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@backend/src/mastra/agents/populate.ts`:
- Around line 91-98: The targetRows env parsing is unsafe and can produce
NaN/0/negative values which disable extraction caps; update the parsing for
targetRows (the variable passed into buildExtractTool) to robustly parse and
validate the environment value (use parseInt/Number, ensure it's a finite
positive integer, floor it) and fall back to the default 20 when the parsed
value is invalid or <= 0; optionally clamp to a sensible maximum before passing
into buildExtractTool to avoid runaway extraction.

---

Duplicate comments:
In `@backend/src/mastra/agents/populate.ts`:
- Around line 31-56: The orchestrator removed the mandated tool surface and must
be restored to expose exactly the three read-only tools: search_web, fetch_page,
and investigate_row and must not perform any write operations itself; revert or
refactor the current orchestration flow that moved responsibilities into
extract_rows/list_rows so that populate (or the populate-agent orchestrator
function) only calls search_web to get URL candidates, calls fetch_page to
retrieve page content, and calls investigate_row to analyze content and emit
leads, and ensure extract_rows/list_rows remain downstream workers (or are
called by investigate_row) rather than replacing the orchestrator tools; locate
usages of extract_rows, list_rows, and the populate-orchestrator function and
change them so the orchestrator exposes and calls the three tool functions
(search_web, fetch_page, investigate_row) and contains no writes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b0a10655-2fca-4ffc-82f7-b3c3f0950c14

📥 Commits

Reviewing files that changed from the base of the PR and between 4fac5bc and 792737c.

📒 Files selected for processing (1)
  • backend/src/mastra/agents/populate.ts

MMeteorL and others added 2 commits May 25, 2026 13:00
- env.ts: validate BIGSET_POPULATE_TARGET_ROWS — NaN, zero, and
  negative values now fall back to the default of 20 instead of
  silently breaking the agent's stop condition.

- datasetRows.mergeUpdate (new Convex internalMutation): atomic
  per-field blank-aware merge. Blank cells are always filled with any
  non-empty incoming value regardless of confidence; non-blank cells are
  only overwritten when newConfidence > existing row confidence.
  The authoritative check lives in Convex so two concurrent investigate
  agents can't both pass a stale client-side confidence check and race
  to write — the compare-and-merge is serialized inside a single
  transaction. Quota is charged only on actual changes; history entries
  are recorded per changed field.

- investigate-tool.ts / buildUpdateRowByKeyTool: drop the row-wide
  confidence pre-check (which blocked filling blank columns on partial
  high-confidence rows) and call mergeUpdate instead of update. The
  local rowIndex is updated optimistically using the same per-field
  rules so subsequent calls within the run stay consistent without an
  extra Convex round-trip.

- datasetRows.deleteIncomplete: skip _-prefixed keys (_confidence,
  _sources, etc.) in the completeness check — they are internal fields
  that are never required columns.

- backend/CLAUDE.md: rewrite Mastra section to reflect the tri-agent
  architecture and new confidence/merge semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
One textual conflict in investigate-tool.ts: commit 1e9cc8b on main
("Unify local env loading") fixed a redundant `?? false` in the old
parseInvestigateResult return block. Our branch had already replaced
that entire function with parseTriageExtractOutput and
parseInvestigateOutput. Git's heuristic matched the `return { ... }`
structure across the two different functions and flagged a false
conflict. Resolution: keep HEAD (our new parsers are correct; the old
function no longer exists).

All other files merged cleanly:
- backend/src/env.ts: main's dotenv path consolidation + our
  POPULATE_TARGET_ROWS validation both land without conflict.
- frontend/convex/datasetRows.ts: main's rowCount denormalized counter
  (insert/remove/clearByDataset) merges alongside our mergeUpdate
  mutation and deleteIncomplete fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #85 ("Make populate status truthful") refactored the /populate route
from synchronous (awaiting the workflow inline) to async: it now claims
the dataset, creates a run, fires runPopulateWorkflowInBackground() as
a background void promise, and returns 202 immediately.

Our branch still had the old synchronous pattern with the post-workflow
notification and status-update logic inline in the route handler —
including our deleteIncomplete pruning addition. The conflict was the
entire old inline block vs main's single 202 return.

Resolution:
- Take main's 202 return — the async pattern is strictly better.
- Move the deleteIncomplete pruning into runPopulateWorkflowInBackground()
  (already introduced by main), placed before the countByDataset call so
  we count only complete rows when deciding whether to set status "live".

All other files (populate.ts workflow, frontend status/schema changes)
merged cleanly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/src/env.ts (1)

37-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix POPULATE_TARGET_ROWS validation to reject non-integers before flooring.

Current logic accepts fractional values and can produce 0 (e.g., 0.5 -> 0), which can prematurely terminate populate runs. It also contradicts the comment that non-integers fallback to 20.

💡 Suggested patch
   POPULATE_TARGET_ROWS: (() => {
     const parsed = Number(process.env.BIGSET_POPULATE_TARGET_ROWS);
-    return Number.isFinite(parsed) && parsed > 0 ? Math.floor(parsed) : 20;
+    return Number.isInteger(parsed) && parsed > 0 ? parsed : 20;
   })(),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/env.ts` around lines 37 - 40, POPULATE_TARGET_ROWS currently
accepts fractional numbers because it checks Number.isFinite(parsed) then
floors, which lets values like 0.5 become 0; change the validation so only
integer positive values are accepted (use Number.isInteger(parsed) && parsed >
0) and return parsed as-is (or Math.floor only if you want to coerce negatives
handled) otherwise fall back to 20; update the initialization for
POPULATE_TARGET_ROWS in env.ts to perform Number.isInteger(parsed) && parsed > 0
before using the value.
frontend/convex/datasetRows.ts (1)

334-360: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

deleteIncomplete does not update the cached rowCount, causing stale dashboard counts.

insert, remove, and clearByDataset all maintain the denormalized rowCount field on the dataset document. However, deleteIncomplete calls ctx.db.delete(row._id) directly without decrementing the counter. After pruning, the dashboard will display a higher row count than actually exists until the next mutation that self-heals.

🐛 Proposed fix: update rowCount after deleting rows
 export const deleteIncomplete = internalMutation({
   args: {
     datasetId: v.id("datasets"),
     columnNames: v.array(v.string()),
   },
   handler: async (ctx, args) => {
     const rows = await ctx.db
       .query("datasetRows")
       .withIndex("by_dataset", (q) => q.eq("datasetId", args.datasetId))
       .collect();

     let deletedCount = 0;
     for (const row of rows) {
       const data = row.data as Record<string, unknown>;
       const isComplete = args.columnNames.every((col) => {
         if (col.startsWith("_")) return true; // skip internal fields (_confidence, _sources, etc.)
         const val = data[col];
         return val !== null && val !== undefined && val !== "";
       });
       if (!isComplete) {
         await ctx.db.delete(row._id);
         deletedCount++;
       }
     }
+
+    // Update the cached rowCount to stay in sync with actual rows.
+    if (deletedCount > 0) {
+      const dataset = await ctx.db.get(args.datasetId);
+      if (dataset) {
+        const newCount =
+          typeof dataset.rowCount === "number"
+            ? Math.max(0, dataset.rowCount - deletedCount)
+            : rows.length - deletedCount; // self-heal if rowCount was missing
+        await ctx.db.patch(args.datasetId, { rowCount: newCount });
+      }
+    }
+
     return { deletedCount };
   },
 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/convex/datasetRows.ts` around lines 334 - 360, The deleteIncomplete
mutation removes rows but never updates the dataset's denormalized rowCount,
leaving stale counts; modify deleteIncomplete to mirror the behavior of
insert/remove/clearByDataset by decrementing the dataset document's rowCount
after deletions (use the same DB update/patch API used by insert/remove to
subtract deletedCount for args.datasetId), or update the rowCount incrementally
as you delete each row so the dataset's rowCount stays consistent with the
actual rows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@backend/src/env.ts`:
- Around line 37-40: POPULATE_TARGET_ROWS currently accepts fractional numbers
because it checks Number.isFinite(parsed) then floors, which lets values like
0.5 become 0; change the validation so only integer positive values are accepted
(use Number.isInteger(parsed) && parsed > 0) and return parsed as-is (or
Math.floor only if you want to coerce negatives handled) otherwise fall back to
20; update the initialization for POPULATE_TARGET_ROWS in env.ts to perform
Number.isInteger(parsed) && parsed > 0 before using the value.

In `@frontend/convex/datasetRows.ts`:
- Around line 334-360: The deleteIncomplete mutation removes rows but never
updates the dataset's denormalized rowCount, leaving stale counts; modify
deleteIncomplete to mirror the behavior of insert/remove/clearByDataset by
decrementing the dataset document's rowCount after deletions (use the same DB
update/patch API used by insert/remove to subtract deletedCount for
args.datasetId), or update the rowCount incrementally as you delete each row so
the dataset's rowCount stays consistent with the actual rows.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa06ba5d-6848-4387-adb5-d3e8897a9229

📥 Commits

Reviewing files that changed from the base of the PR and between befd505 and 6d9fd2d.

📒 Files selected for processing (5)
  • backend/package.json
  • backend/src/env.ts
  • backend/src/index.ts
  • backend/src/mastra/workflows/populate.ts
  • frontend/convex/datasetRows.ts
✅ Files skipped from review due to trivial changes (1)
  • backend/package.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants