Skip to content

Data Collection Workflow Update#74

Open
MMeteorL wants to merge 34 commits into
mainfrom
codex/collection-stack-merged-main
Open

Data Collection Workflow Update#74
MMeteorL wants to merge 34 commits into
mainfrom
codex/collection-stack-merged-main

Conversation

@MMeteorL
Copy link
Copy Markdown
Collaborator

Make performance improvement for the agent by adding in sourceUrl prioritization, extraction agent parallelism, enabling routes for tinyfish agent and playwright agent, and a centralized memory to handle data that should go into the playwright agent. Also made some other changes on benchmark testing to include open ended user prompts that could produce rows capped at 100, or close to it. Made some changes to make sure that the data collection agent can receive input from and make output to the frontend. See the documentation for details.

Simantak Dabhade and others added 30 commits May 21, 2026 21:07
Introduces the "Clear & Populate" flow: an AI agent (Claude Sonnet 4.6
via OpenRouter) searches the web using TinyFish APIs, fetches page
content, and inserts real data into datasets row by row.

Backend:
- Mastra populate workflow (clear rows → build prompt → run agent)
- Populate agent with 7 tools: 5 database CRUD (insert, list, get,
  update, delete) + 2 web (search_web via TinyFish Search API,
  fetch_page via TinyFish Fetch API)
- All tools return structured errors so the agent can self-correct
- Data keys are sanitized to strip stray quotes/backticks from LLM output
- Fetch responses capped at 15K chars to protect agent context window
- Convex client uses anyApi to avoid cross-project imports in Docker
- POST /populate route with Clerk JWT auth

Frontend:
- "Clear & Populate" button on dataset detail page
- API client function in lib/backend.ts
- Rows appear in realtime via Convex reactive queries

Convex:
- New internal functions: datasetRows.get (query) and datasetRows.remove
  (mutation) for single-row read/delete

Infra:
- TINYFISH_API_KEY wired through docker-compose.dev.yml to backend
  and mastra services

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Enforce dataset ownership on POST /populate by querying Convex for
  the dataset and comparing ownerId to req.auth.userId before running
  the workflow (fixes authz gap)
- Remove raw row payloads from insert_row/update_row logs, log column
  count instead to avoid PII leakage
- Add 30s AbortController timeouts to both TinyFish fetch calls in
  web-tools.ts so they can't hang indefinitely
- Align PopulateResult type (rows → result) to match actual backend
  response shape

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convex query for dataset lookup can throw on invalid IDs — wrapping
it in the existing try/catch ensures controlled 400 responses instead
of unhandled 500s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MMeteorL and others added 3 commits May 22, 2026 17:57
Integrate #26 populate agent foundation while keeping the self-healing
HTTP path, source-backed Mastra agent instructions, and populate tools.

Co-authored-by: Cursor <cursoragent@cursor.com>
Branched from codex/collection-official-website-sources at the merge
commit that integrated main (PR #26).

Co-authored-by: Cursor <cursoragent@cursor.com>
…rioritization, parallelism, and option to run search with tinyfish agent.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

Important

Review skipped

Too many files!

This PR contains 167 files, which is 17 over the limit of 150.

To get a review, narrow the scope:
• coderabbit review --type committed # exclude uncommitted changes
• coderabbit review --dir # limit to a subdirectory
• coderabbit review --base # compare against a closer base

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ffa70225-339d-498f-814e-9eafdb880a31

📥 Commits

Reviewing files that changed from the base of the PR and between 9f8d5cd and f4111e8.

⛔ Files ignored due to path filters (1)
  • backend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (167)
  • .env.example
  • .gitignore
  • CLAUDE.md
  • backend/.env.example
  • backend/.gitignore
  • backend/BigSet_Data_Collection_Agent/docs/architecture.md
  • backend/BigSet_Data_Collection_Agent/docs/data-flow.md
  • backend/BigSet_Data_Collection_Agent/docs/v15-02-combined-triage-extract.md
  • backend/BigSet_Data_Collection_Agent/docs/v15-efficiency-planned.md
  • backend/BigSet_Data_Collection_Agent/src/acquisition/link-follow.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/agent-goal.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/benchmark-spec.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/dataset-spec.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/extract-from-agent.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/extract.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/repair-diagnosis.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/repair-queries.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/source-policy.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/source-triage.ts
  • backend/BigSet_Data_Collection_Agent/src/agents/triage-extract.ts
  • backend/BigSet_Data_Collection_Agent/src/config.ts
  • backend/BigSet_Data_Collection_Agent/src/coverage/analyze.ts
  • backend/BigSet_Data_Collection_Agent/src/export/csv-compiler.ts
  • backend/BigSet_Data_Collection_Agent/src/export/select-results.ts
  • backend/BigSet_Data_Collection_Agent/src/integrations/openrouter.ts
  • backend/BigSet_Data_Collection_Agent/src/integrations/tinyfish-agent.ts
  • backend/BigSet_Data_Collection_Agent/src/integrations/tinyfish.ts
  • backend/BigSet_Data_Collection_Agent/src/llm/complete-json.ts
  • backend/BigSet_Data_Collection_Agent/src/llm/provider.ts
  • backend/BigSet_Data_Collection_Agent/src/llm/usage.ts
  • backend/BigSet_Data_Collection_Agent/src/memory/fingerprint.ts
  • backend/BigSet_Data_Collection_Agent/src/memory/index.ts
  • backend/BigSet_Data_Collection_Agent/src/memory/scored-aggregates.ts
  • backend/BigSet_Data_Collection_Agent/src/memory/search-pagination.ts
  • backend/BigSet_Data_Collection_Agent/src/memory/store.ts
  • backend/BigSet_Data_Collection_Agent/src/memory/types.ts
  • backend/BigSet_Data_Collection_Agent/src/memory/workflow-memory.ts
  • backend/BigSet_Data_Collection_Agent/src/merge/records.ts
  • backend/BigSet_Data_Collection_Agent/src/models/quality.ts
  • backend/BigSet_Data_Collection_Agent/src/models/schemas.ts
  • backend/BigSet_Data_Collection_Agent/src/models/source-status.ts
  • backend/BigSet_Data_Collection_Agent/src/orchestrator/acquisition.ts
  • backend/BigSet_Data_Collection_Agent/src/orchestrator/pipeline.ts
  • backend/BigSet_Data_Collection_Agent/src/orchestrator/process-pages.ts
  • backend/BigSet_Data_Collection_Agent/src/orchestrator/repair-loop.ts
  • backend/BigSet_Data_Collection_Agent/src/quality/build-report.ts
  • backend/BigSet_Data_Collection_Agent/src/quality/field-confidence.ts
  • backend/BigSet_Data_Collection_Agent/src/quality/index.ts
  • backend/BigSet_Data_Collection_Agent/src/quality/score-record.ts
  • backend/BigSet_Data_Collection_Agent/src/queue/domain-throttle.ts
  • backend/BigSet_Data_Collection_Agent/src/queue/pools.ts
  • backend/BigSet_Data_Collection_Agent/src/queue/rate-limiter.ts
  • backend/BigSet_Data_Collection_Agent/src/queue/retry.ts
  • backend/BigSet_Data_Collection_Agent/src/queue/task-queue.ts
  • backend/BigSet_Data_Collection_Agent/src/records/source-urls.ts
  • backend/BigSet_Data_Collection_Agent/src/storage/run-loader.ts
  • backend/BigSet_Data_Collection_Agent/src/storage/run-store.ts
  • backend/BigSet_Data_Collection_Agent/src/utils/concurrency.ts
  • backend/BigSet_Data_Collection_Agent/src/utils/url.ts
  • backend/CLAUDE.md
  • backend/docs/playwright-agent-integration.md
  • backend/docs/populate-collection-architecture.md
  • backend/docs/tinyfish-emitted-process-capture.md
  • backend/package.json
  • backend/prompts/schema-inference.txt
  • backend/src/convex.ts
  • backend/src/env.ts
  • backend/src/index.ts
  • backend/src/mastra/agents/populate-tools.ts
  • backend/src/mastra/agents/populate.ts
  • backend/src/mastra/agents/search-acquisition.ts
  • backend/src/mastra/index.ts
  • backend/src/mastra/tools/web-tools.ts
  • backend/src/mastra/workflows/populate.ts
  • backend/src/openrouter-models.ts
  • backend/src/pipeline/collection-agent-runner.ts
  • backend/src/pipeline/collection-memory/fingerprint.ts
  • backend/src/pipeline/collection-memory/index.ts
  • backend/src/pipeline/collection-memory/mutations.ts
  • backend/src/pipeline/collection-memory/service.ts
  • backend/src/pipeline/collection-memory/store.ts
  • backend/src/pipeline/collection-memory/types.ts
  • backend/src/pipeline/llm-usage.ts
  • backend/src/pipeline/populate-acquisition-prompt.ts
  • backend/src/pipeline/populate-acquisition.ts
  • backend/src/pipeline/populate-benchmark-debug.ts
  • backend/src/pipeline/populate-browser-agent.ts
  • backend/src/pipeline/populate-collection-memory-config.ts
  • backend/src/pipeline/populate-collection-runtime.ts
  • backend/src/pipeline/populate-convex-writer.ts
  • backend/src/pipeline/populate-dataset-context-loader.ts
  • backend/src/pipeline/populate-extract-from-agent.ts
  • backend/src/pipeline/populate-extract-records.ts
  • backend/src/pipeline/populate-extraction-spec.ts
  • backend/src/pipeline/populate-llm-json.ts
  • backend/src/pipeline/populate-merge-rows.ts
  • backend/src/pipeline/populate-normalize-dataset-keys.ts
  • backend/src/pipeline/populate-parallel-config.ts
  • backend/src/pipeline/populate-parallel.ts
  • backend/src/pipeline/populate-playwright-agent.ts
  • backend/src/pipeline/populate-prompt.ts
  • backend/src/pipeline/populate-row.ts
  • backend/src/pipeline/populate-runtime-limits.ts
  • backend/src/pipeline/populate-runtime-prerequisites.ts
  • backend/src/pipeline/populate-runtime-selection.ts
  • backend/src/pipeline/populate-runtime.ts
  • backend/src/pipeline/populate-search-prioritization.ts
  • backend/src/pipeline/populate-self-healing-cli.ts
  • backend/src/pipeline/populate-self-healing-command.ts
  • backend/src/pipeline/populate-self-healing-runner.ts
  • backend/src/pipeline/populate-self-healing.ts
  • backend/src/pipeline/populate-source-status.ts
  • backend/src/pipeline/populate-tinyfish-agent.ts
  • backend/src/pipeline/populate-triage-extract.ts
  • backend/src/pipeline/populate-url-utils.ts
  • backend/src/pipeline/populate-web-types.ts
  • backend/src/pipeline/schema-inference.ts
  • backend/src/pipeline/types.ts
  • backend/src/server.ts
  • backend/test/collection-agent-runner.test.ts
  • backend/test/collection-extract-finalize.test.ts
  • backend/test/collection-memory.test.ts
  • backend/test/collection-record-merge.test.ts
  • backend/test/collection-source-policy.test.ts
  • backend/test/llm-usage.test.ts
  • backend/test/populate-acquisition-prompt.test.ts
  • backend/test/populate-acquisition.test.ts
  • backend/test/populate-benchmark-debug.test.ts
  • backend/test/populate-collection-runtime.test.ts
  • backend/test/populate-convex-writer.test.ts
  • backend/test/populate-dataset-context-loader.test.ts
  • backend/test/populate-extract-records.test.ts
  • backend/test/populate-normalize-dataset-keys.test.ts
  • backend/test/populate-parallel.test.ts
  • backend/test/populate-runtime-limits.test.ts
  • backend/test/populate-runtime-prerequisites.test.ts
  • backend/test/populate-runtime-selection.test.ts
  • backend/test/populate-runtime.test.ts
  • backend/test/populate-self-healing-command.test.ts
  • backend/test/populate-self-healing-runner.test.ts
  • backend/test/populate-self-healing.test.ts
  • backend/test/populate-server.test.ts
  • backend/test/populate-test-hooks.ts
  • backend/test/schema-inference.test.ts
  • benchmarks/dataset-agent/README.md
  • benchmarks/dataset-agent/adapters/.gitignore
  • benchmarks/dataset-agent/adapters/collection-self-healing-adapter.mjs
  • benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs
  • benchmarks/dataset-agent/adapters/smoke-adapter.mjs
  • benchmarks/dataset-agent/adapters/template-adapter.mjs
  • benchmarks/dataset-agent/answer-keys-entity.mjs
  • benchmarks/dataset-agent/prompts.json
  • benchmarks/dataset-agent/run-benchmark.mjs
  • benchmarks/dataset-agent/run-benchmark.test.mjs
  • docker-compose.dev.yml
  • docs/branch-lineage.md
  • docs/data-collection-agent-migration-plan.md
  • frontend/.gitignore
  • frontend/AGENTS.md
  • frontend/CLAUDE.md
  • frontend/app/dataset/new/page.tsx
  • frontend/convex/datasetRows.ts
  • frontend/convex/datasets.ts
  • frontend/lib/backend.ts
  • frontend/skills-lock.json
  • makefiles/Makefile
  • scripts/verify-self-healing-stack.sh

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/collection-stack-merged-main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MMeteorL MMeteorL changed the title Codex/collection stack merged main Data Collection Workflow Update May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants