adding an isolated benchmarking system, and minor documentation changes by MMeteorL · Pull Request #90 · tinyfish-io/bigset

MMeteorL · 2026-05-25T10:59:22Z

An isolated benchmark testing system for the current Mastra agent, based on Edward's implementation of benchmark testing on this branch: codex/mastra-populate-benchmark.

Made some minor documentation changes for running the system with npm install first in frontend/ folder, and adding a structural documentation for the current agent system.

coderabbitai · 2026-05-25T10:59:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ecdcd7ba-082d-40f4-ab37-978ffe5e7eb8

📥 Commits

Reviewing files that changed from the base of the PR and between 693d132 and 6592646.

📒 Files selected for processing (2)

.gitignore
README.md

✅ Files skipped from review due to trivial changes (2)

.gitignore
README.md

📝 Walkthrough

Walkthrough

This pull request adds a complete benchmarking infrastructure for evaluating dataset-collection agents. The changes span project configuration (gitignore, README), a core tracing library and stdout contract, two adapters (smoke and Mastra populate), prompt and answer-key datasets, a comprehensive run-benchmark CLI with scoring/rescoring and reporting, tests for failure/scoring logic, and architecture documentation describing the two-phase populate pipeline and orchestrator/subagent interactions.

Sequence Diagram(s)

sequenceDiagram
  participant Runner
  participant RunBenchmark
  participant OrchestratorAgent
  participant InvestigateAgent
  participant search_web
  participant fetch_page
  participant InMemoryDataset
  participant ArtifactWriter

  Runner->>RunBenchmark: start(system, prompt, env)
  RunBenchmark->>OrchestratorAgent: spawn orchestrator (prompt, tools, env)
  OrchestratorAgent->>InvestigateAgent: call investigate_row tool (context)
  InvestigateAgent->>search_web: perform search_web(query)
  InvestigateAgent->>fetch_page: perform fetch_page(url)
  InvestigateAgent->>InMemoryDataset: insert_row(row)
  InvestigateAgent->>ArtifactWriter: emit session trace (snapshot)
  OrchestratorAgent->>RunBenchmark: return payload (stdout or artifact)
  RunBenchmark->>ArtifactWriter: write parsed-output.json, stdout/stderr, summary

Possibly related PRs

tinyfish-io/bigset#81: The benchmark adapters implement and parse the investigate_row subagent workflow using the same INSERTED/SUMMARY/CLUES/REASON output contract that the backend two-tier populate agents introduced in PR #81.

Suggested reviewers

hwennnn
simantak-dabhade

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: adding a benchmarking system and updating documentation. It aligns with the substantial code additions for benchmarks and README/docs updates.
Description check	✅ Passed	The description is related to the changeset, mentioning the benchmark testing system based on Edward's implementation and documentation changes for frontend npm install and agent documentation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch benchmark_testing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs`:
- Around line 355-369: The current insert path (using data -> cleanDataKeys ->
store.insert in this block) allows rows where every field is empty (e.g., "",
null, undefined) to be counted; add a validation after computing cleaned that
rejects the request when all values are empty: compute Object.values(cleaned)
and check that at least one value is "meaningful" (not null/undefined and not an
empty string/empty array), return { success: false, error: 'row must contain at
least one non-empty value' } instead of inserting, otherwise proceed to
store.insert and trace.recordInsert as before.
- Around line 35-99: The call to trace.initArtifacts runs before the try/finally
that ensures trace.restoreConsole() and trace.finalize() run, so failures there
leave console patched and artifacts unfinalized; move or wrap await
trace.initArtifacts({...}) inside the same try block that calls
trace.restoreConsole()/trace.finalize() (i.e., ensure createBenchmarkTrace(),
then enter try { await trace.initArtifacts(...); ... } finally {
trace.restoreConsole(); trace.finalize(); }) so any exception during
initArtifacts still triggers the cleanup.

In `@benchmarks/dataset-agent/README.md`:
- Around line 8-12: The README currently switches to the backend working
directory with "cd backend && npm ci" but the subsequent run command uses
root-relative paths ("node benchmarks/dataset-agent/run-benchmark.mjs" and the
adapter path "benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs"),
causing path resolution to fail; fix by making the working directory usage
consistent: either keep the "cd backend" step and update the run command to
reference the backend directory (prefix paths with "../" or "./backend/" as
appropriate) or remove the "cd backend" and run "npm ci" using "npm ci --prefix
backend" so the run command can stay root-relative—ensure the updated
instructions consistently reference the same cwd for the run-benchmark.mjs
command and the mastra-populate-adapter.mjs adapter path.

In `@benchmarks/dataset-agent/run-benchmark.mjs`:
- Around line 570-584: The current timeout handler only sends SIGTERM
(child.kill("SIGTERM")) which can hang if the child ignores it; update the
timeout logic in the setTimeout callback to send SIGTERM first, then schedule a
short secondary timer (grace period) that, if the child is still running, sends
SIGKILL (child.kill("SIGKILL")), and ensure both timers are cleared when the
child emits "close" so resolve({ stdout, stderr, exitCode, timedOut }) always
runs; reference the existing timeout variable, timedOut flag,
child.kill("SIGTERM") call, and the child.on("close", ...) resolver when
implementing this change.
- Around line 418-429: The config currently seeds targetContract with a spread
of defaultTargetContract which gets merged over entityBenchmarkContract in
resolveTargetContract and unintentionally relaxes stricter entity gates; change
the initialization in run-benchmark.mjs to not prepopulate config.targetContract
with {...defaultTargetContract} (use null/undefined or omit the property) so
resolveTargetContract can correctly apply entityBenchmarkContract and
defaultTargetContract in the intended order, or alternatively adjust
resolveTargetContract merge order to ensure entityBenchmarkContract overrides
defaults (refer to config.targetContract and the resolveTargetContract
function).
- Around line 1442-1447: rowHasAllowedDomain currently treats hostnames as
allowed if urlHostname(url).endsWith(allowedDomain), which allows lookalike
matches (e.g. notopenai.com). Fix by normalizing the hostname via
urlHostname(url) and matching either exact equality (hostname === allowedDomain)
or a subdomain boundary (hostname.endsWith('.' + allowedDomain)); update the
logic inside rowHasAllowedDomain that iterates rowSourceUrls and uses
urlHostname so it only returns true for exact domain or proper subdomain
matches.
- Around line 1260-1275: The current bare patterns in blockerPatterns
(especially "tinyfish_api_key" and "openrouter_api_key") cause false positives
when those strings appear in normal text; update those two entries to be more
specific by requiring word boundaries and an assignment/contextual indicator
(e.g., require \b around the key name and expect an assignment or value within a
short window like =, :, or quotes within ~0–80 chars). Edit the blockerPatterns
array (the tinyfish_api_key and openrouter_api_key entries) to replace the bare
literals with regexes such as requiring
\b(?:tinyfish_api_key|openrouter_api_key)\b followed by a nearby
assignment/value pattern so only actual credential-like occurrences trigger the
blocker.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8fb3815-8f14-4139-88fb-7ce86a7aba0d

📥 Commits

Reviewing files that changed from the base of the PR and between 1e9cc8b and 693d132.

📒 Files selected for processing (12)

.gitignore
README.md
benchmarks/dataset-agent/README.md
benchmarks/dataset-agent/adapters/.gitignore
benchmarks/dataset-agent/adapters/lib/mastra-benchmark-trace.mjs
benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs
benchmarks/dataset-agent/adapters/smoke-adapter.mjs
benchmarks/dataset-agent/answer-keys-entity.mjs
benchmarks/dataset-agent/prompts.json
benchmarks/dataset-agent/run-benchmark.mjs
benchmarks/dataset-agent/run-benchmark.test.mjs
docs/data-collection-agents.md

coderabbitai · 2026-05-25T11:08:50Z

+const trace = createBenchmarkTrace();
+const columns = requiredColumns.map((columnName) => ({
+  name: columnName,
+  type: inferPopulateColumnType(columnName),
+  description: `Benchmark requested column for ${promptQuality} prompt.`,
+}));
+const datasetName = `benchmark_${safeIdSegment(promptId)}`;
+const maxSteps = Number(process.env.BIGSET_MASTRA_BENCHMARK_MAX_STEPS ?? "80");
+const targetRows = Number(process.env.BIGSET_MASTRA_BENCHMARK_TARGET_ROWS ?? "20");
+
+const store = createRowStore(trace);
+const metrics = {
+  searchCallCount: 0,
+  fetchCallCount: 0,
+  browserCallCount: 0,
+  agentRunCount: 0,
+  agentStepCount: 0,
+};
+
+const authContext = {
+  authorizedUserId: "benchmark",
+  workflowRunId: `benchmark-${promptId}-${Date.now()}`,
+};
+const authorizedDatasetId = `benchmark-${safeIdSegment(promptId)}`;
+
+const openrouter = createOpenRouter({
+  apiKey: process.env.OPENROUTER_API_KEY,
+});
+
+const ORCHESTRATOR_INSTRUCTIONS = `You fill datasets by finding real leads and handing them to subagents for deep research.
+
+1. Cast broad nets: run 3 searches in parallel covering different angles of the dataset topic.
+   Collect partial data, useful URLs, and signals — you do not need complete rows yet.
+
+2. Hand off leads: call investigate_row for each promising lead.
+   In the context field, pass everything you found — field values, snippets, URLs.
+   - First batch: exactly 3 in parallel. Wait for all to finish and read every clue.
+   - Second batch: up to 10 in parallel. Wait for all to finish and read every clue.
+   - All subsequent batches: no limit — spawn as many as you have good leads.
+
+3. Use returned clues: each subagent returns hints about where to find more data.
+   Feed those clues into the next batch of investigate_row calls.
+
+4. Keep going until you have 20 inserted rows or have exhausted real leads.
+
+Do not insert rows yourself — only investigate_row subagents can write to the dataset.
+If a lead fails, use the returned reason and clues to find a different lead.`;
+
+const agentPrompt = buildPopulatePrompt();
+let validationIssues = [];
+let orchestratorError = null;
+
+await trace.initArtifacts({
+  promptId,
+  datasetName,
+  authorizedDatasetId,
+  userPrompt: prompt,
+  orchestratorPrompt: agentPrompt,
+  requiredColumns,
+  maxSteps,
+  targetRows,
+});
+
+trace.setPayloadSnapshot(() => buildBenchmarkPayload());
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard initArtifacts with the same try/finally lifecycle.

createBenchmarkTrace() patches console.log immediately, but await trace.initArtifacts(...) runs before the try/finally block. If artifact initialization fails, restoreConsole() and finalize() never run.

💡 Suggested control-flow fix

-const trace = createBenchmarkTrace(); +const trace = createBenchmarkTrace(); ... -await trace.initArtifacts({ - promptId, - datasetName, - authorizedDatasetId, - userPrompt: prompt, - orchestratorPrompt: agentPrompt, - requiredColumns, - maxSteps, - targetRows, -}); - -trace.setPayloadSnapshot(() => buildBenchmarkPayload()); - try { + await trace.initArtifacts({ + promptId, + datasetName, + authorizedDatasetId, + userPrompt: prompt, + orchestratorPrompt: agentPrompt, + requiredColumns, + maxSteps, + targetRows, + }); + trace.setPayloadSnapshot(() => buildBenchmarkPayload()); + const agent = buildPopulateAgent(); metrics.agentRunCount += 1; const startedAt = new Date().toISOString();

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs` around lines 35 - 99, The call to trace.initArtifacts runs before the try/finally that ensures trace.restoreConsole() and trace.finalize() run, so failures there leave console patched and artifacts unfinalized; move or wrap await trace.initArtifacts({...}) inside the same try block that calls trace.restoreConsole()/trace.finalize() (i.e., ensure createBenchmarkTrace(), then enter try { await trace.initArtifacts(...); ... } finally { trace.restoreConsole(); trace.finalize(); }) so any exception during initArtifacts still triggers the cleanup.

coderabbitai · 2026-05-25T11:08:50Z

+      if (!data || Object.keys(data).length === 0) {
+        return {
+          success: false,
+          error:
+            'data is required and must have at least one key. Pass an object like { "Column Name": value }.',
+        };
+      }
+      const cleaned = cleanDataKeys(data);
+      const row = store.insert(cleaned, meta);
+      await trace.recordInsert({
+        sessionId: meta.sessionId,
+        entityHint: meta.entityHint,
+        row,
+      });
+      return { success: true };


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject all-empty inserts to protect benchmark integrity.

Current validation allows { colA: "", colB: "" } (or null-only rows) to be inserted and counted. That can satisfy target row counts without meaningful data.

💡 Suggested validation patch

execute: async ({ data }) => { if (!data || Object.keys(data).length === 0) { return { success: false, error: 'data is required and must have at least one key. Pass an object like { "Column Name": value }.', }; } const cleaned = cleanDataKeys(data); + const hasNonEmptyValue = Object.values(cleaned).some( + (value) => value !== undefined && value !== null && String(value).trim() !== "" + ); + if (!hasNonEmptyValue) { + return { + success: false, + error: "at least one column must contain a non-empty value.", + }; + } const row = store.insert(cleaned, meta); await trace.recordInsert({ sessionId: meta.sessionId, entityHint: meta.entityHint, row, });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (!data || Object.keys(data).length === 0) {

return {

success: false,

error:

'data is required and must have at least one key. Pass an object like { "Column Name": value }.',

};

}

const cleaned = cleanDataKeys(data);

const row = store.insert(cleaned, meta);

await trace.recordInsert({

sessionId: meta.sessionId,

entityHint: meta.entityHint,

row,

});

return { success: true };

if (!data || Object.keys(data).length === 0) {

return {

success: false,

error:

'data is required and must have at least one key. Pass an object like { "Column Name": value }.',

};

}

const cleaned = cleanDataKeys(data);

const hasNonEmptyValue = Object.values(cleaned).some(

(value) => value !== undefined && value !== null && String(value).trim() !== ""

);

if (!hasNonEmptyValue) {

return {

success: false,

error: "at least one column must contain a non-empty value.",

};

}

const row = store.insert(cleaned, meta);

await trace.recordInsert({

sessionId: meta.sessionId,

entityHint: meta.entityHint,

row,

});

return { success: true };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs` around lines 355 - 369, The current insert path (using data -> cleanDataKeys -> store.insert in this block) allows rows where every field is empty (e.g., "", null, undefined) to be counted; add a validation after computing cleaned that rejects the request when all values are empty: compute Object.values(cleaned) and check that at least one value is "meaningful" (not null/undefined and not an empty string/empty array), return { success: false, error: 'row must contain at least one non-empty value' } instead of inserting, otherwise proceed to store.insert and trace.recordInsert as before.

coderabbitai · 2026-05-25T11:08:50Z

+cd backend && npm ci
+
+node benchmarks/dataset-agent/run-benchmark.mjs \
+  --system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix working-directory mismatch in run instructions.

Line 8 changes cwd to backend, but Lines 10-12 use root-relative paths. Running those snippets sequentially will fail path resolution.

💡 Suggested doc fix

-cd backend && npm ci +cd backend && npm ci && cd .. node benchmarks/dataset-agent/run-benchmark.mjs \ --system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd backend && npm ci

node benchmarks/dataset-agent/run-benchmark.mjs \

--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'

```

cd backend && npm ci && cd ..

node benchmarks/dataset-agent/run-benchmark.mjs \

--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/dataset-agent/README.md` around lines 8 - 12, The README currently switches to the backend working directory with "cd backend && npm ci" but the subsequent run command uses root-relative paths ("node benchmarks/dataset-agent/run-benchmark.mjs" and the adapter path "benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs"), causing path resolution to fail; fix by making the working directory usage consistent: either keep the "cd backend" step and update the run command to reference the backend directory (prefix paths with "../" or "./backend/" as appropriate) or remove the "cd backend" and run "npm ci" using "npm ci --prefix backend" so the run command can stay root-relative—ensure the updated instructions consistently reference the same cwd for the run-benchmark.mjs command and the mastra-populate-adapter.mjs adapter path.

coderabbitai · 2026-05-25T11:08:50Z

+  const config = {
+    promptsPath: defaultPromptsPath,
+    promptIds: null,
+    systems: [],
+    timeoutMs: 10 * 60 * 1000,
+    inputUsdPer1M: 0.05,
+    outputUsdPer1M: 0.5,
+    tinyFishAgentStepUsd: 0.015,
+    targetContract: { ...defaultTargetContract },
+    minRequiredCompleteness: defaultTargetContract.minRequiredCompleteness,
+    minFactualAccuracy: defaultTargetContract.minFactualAccuracy,
+  };


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Entity benchmark defaults are being unintentionally relaxed.

config.targetContract is initialized with open-ended defaults, then merged over entityBenchmarkContract in resolveTargetContract, which disables the stricter entity gates by default.

💡 Proposed fix

function parseArgs(args) { const config = { @@ - targetContract: { ...defaultTargetContract }, + // CLI overrides only; per-prompt defaults come from resolveTargetContract(). + targetContract: {}, minRequiredCompleteness: defaultTargetContract.minRequiredCompleteness, minFactualAccuracy: defaultTargetContract.minFactualAccuracy, };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const config = {

promptsPath: defaultPromptsPath,

promptIds: null,

systems: [],

timeoutMs: 10 * 60 * 1000,

inputUsdPer1M: 0.05,

outputUsdPer1M: 0.5,

tinyFishAgentStepUsd: 0.015,

targetContract: { ...defaultTargetContract },

minRequiredCompleteness: defaultTargetContract.minRequiredCompleteness,

minFactualAccuracy: defaultTargetContract.minFactualAccuracy,

};

const config = {

promptsPath: defaultPromptsPath,

promptIds: null,

systems: [],

timeoutMs: 10 * 60 * 1000,

inputUsdPer1M: 0.05,

outputUsdPer1M: 0.5,

tinyFishAgentStepUsd: 0.015,

// CLI overrides only; per-prompt defaults come from resolveTargetContract().

targetContract: {},

minRequiredCompleteness: defaultTargetContract.minRequiredCompleteness,

minFactualAccuracy: defaultTargetContract.minFactualAccuracy,

};

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/dataset-agent/run-benchmark.mjs` around lines 418 - 429, The config currently seeds targetContract with a spread of defaultTargetContract which gets merged over entityBenchmarkContract in resolveTargetContract and unintentionally relaxes stricter entity gates; change the initialization in run-benchmark.mjs to not prepopulate config.targetContract with {...defaultTargetContract} (use null/undefined or omit the property) so resolveTargetContract can correctly apply entityBenchmarkContract and defaultTargetContract in the intended order, or alternatively adjust resolveTargetContract merge order to ensure entityBenchmarkContract overrides defaults (refer to config.targetContract and the resolveTargetContract function).

coderabbitai · 2026-05-25T11:08:50Z

+    const timeout = setTimeout(() => {
+      timedOut = true;
+      child.kill("SIGTERM");
+    }, timeoutMs);
+
+    child.stdout.on("data", (chunk) => {
+      stdout += chunk.toString();
+    });
+    child.stderr.on("data", (chunk) => {
+      stderr += chunk.toString();
+    });
+    child.on("close", (exitCode) => {
+      clearTimeout(timeout);
+      resolve({ stdout, stderr, exitCode: exitCode ?? 1, timedOut });
+    });


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Timeout handling can hang when the child ignores SIGTERM.

After timeout, only SIGTERM is sent. If the process doesn’t exit, the promise can remain unresolved indefinitely.

💡 Proposed fix

function runCommand({ command, timeoutMs, env }) { return new Promise((resolve) => { @@ - const timeout = setTimeout(() => { + let forceKillTimeout; + const timeout = setTimeout(() => { timedOut = true; child.kill("SIGTERM"); + forceKillTimeout = setTimeout(() => { + child.kill("SIGKILL"); + }, 5_000); }, timeoutMs); @@ child.on("close", (exitCode) => { clearTimeout(timeout); + if (forceKillTimeout) clearTimeout(forceKillTimeout); resolve({ stdout, stderr, exitCode: exitCode ?? 1, timedOut }); }); }); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const timeout = setTimeout(() => {

timedOut = true;

child.kill("SIGTERM");

}, timeoutMs);

child.stdout.on("data", (chunk) => {

stdout += chunk.toString();

});

child.stderr.on("data", (chunk) => {

stderr += chunk.toString();

});

child.on("close", (exitCode) => {

clearTimeout(timeout);

resolve({ stdout, stderr, exitCode: exitCode ?? 1, timedOut });

});

let forceKillTimeout;

const timeout = setTimeout(() => {

timedOut = true;

child.kill("SIGTERM");

forceKillTimeout = setTimeout(() => {

child.kill("SIGKILL");

}, 5_000);

}, timeoutMs);

child.stdout.on("data", (chunk) => {

stdout += chunk.toString();

});

child.stderr.on("data", (chunk) => {

stderr += chunk.toString();

});

child.on("close", (exitCode) => {

clearTimeout(timeout);

if (forceKillTimeout) clearTimeout(forceKillTimeout);

resolve({ stdout, stderr, exitCode: exitCode ?? 1, timedOut });

});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/dataset-agent/run-benchmark.mjs` around lines 570 - 584, The current timeout handler only sends SIGTERM (child.kill("SIGTERM")) which can hang if the child ignores it; update the timeout logic in the setTimeout callback to send SIGTERM first, then schedule a short secondary timer (grace period) that, if the child is still running, sends SIGKILL (child.kill("SIGKILL")), and ensure both timers are cleared when the child emits "close" so resolve({ stdout, stderr, exitCode, timedOut }) always runs; reference the existing timeout variable, timedOut flag, child.kill("SIGTERM") call, and the child.on("close", ...) resolver when implementing this change.

coderabbitai · 2026-05-25T11:08:50Z

+  const blockerPatterns = [
+    /authentication failed/,
+    /active subscription/,
+    /insufficient credits/,
+    /not enough credits/,
+    /(?:missing|required|invalid|not configured|not set|unset)[^.]{0,80}api[_ -]?key/,
+    /api[_ -]?key[^.]{0,80}(?:missing|required|invalid|not configured|not set|unset)/,
+    /tinyfish_api_key/,
+    /openrouter_api_key/,
+    /quota exceeded/,
+    /rate[_ -]?limit[_ -]?exceeded/,
+    /benchmark deadline/,
+  ];
+  return blockerPatterns.some((pattern) => pattern.test(combinedText))
+    ? "Infrastructure/auth/credits blocker."
+    : null;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Infra blocker detection can false-positive on benign payload text.

Bare patterns for tinyfish_api_key / openrouter_api_key mark runs as blocked even when those strings appear in normal docs/output text.

💡 Proposed fix

const blockerPatterns = [ @@ - /tinyfish_api_key/, - /openrouter_api_key/, + /(?:missing|required|invalid|not configured|not set|unset)[^.]{0,80}(?:tinyfish_api_key|openrouter_api_key)/, + /(?:tinyfish_api_key|openrouter_api_key)[^.]{0,80}(?:missing|required|invalid|not configured|not set|unset)/, @@ ];

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/dataset-agent/run-benchmark.mjs` around lines 1260 - 1275, The current bare patterns in blockerPatterns (especially "tinyfish_api_key" and "openrouter_api_key") cause false positives when those strings appear in normal text; update those two entries to be more specific by requiring word boundaries and an assignment/contextual indicator (e.g., require \b around the key name and expect an assignment or value within a short window like =, :, or quotes within ~0–80 chars). Edit the blockerPatterns array (the tinyfish_api_key and openrouter_api_key entries) to replace the bare literals with regexes such as requiring \b(?:tinyfish_api_key|openrouter_api_key)\b followed by a nearby assignment/value pattern so only actual credential-like occurrences trigger the blocker.

coderabbitai · 2026-05-25T11:08:51Z

+function rowHasAllowedDomain(row, allowedDomains) {
+  if (!allowedDomains?.length) return true;
+  const cells = rowCells(row);
+  return rowSourceUrls(row, cells).some((url) =>
+    allowedDomains.some((allowedDomain) => urlHostname(url).endsWith(allowedDomain))
+  );


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Domain validation accepts lookalike hostnames.

Using endsWith(allowedDomain) allows false positives like notopenai.com matching openai.com.

💡 Proposed fix

function rowHasAllowedDomain(row, allowedDomains) { if (!allowedDomains?.length) return true; const cells = rowCells(row); return rowSourceUrls(row, cells).some((url) => - allowedDomains.some((allowedDomain) => urlHostname(url).endsWith(allowedDomain)) + allowedDomains.some((allowedDomain) => { + const host = urlHostname(url); + return host === allowedDomain || host.endsWith(`.${allowedDomain}`); + }) ); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/dataset-agent/run-benchmark.mjs` around lines 1442 - 1447, rowHasAllowedDomain currently treats hostnames as allowed if urlHostname(url).endsWith(allowedDomain), which allows lookalike matches (e.g. notopenai.com). Fix by normalizing the hostname via urlHostname(url) and matching either exact equality (hostname === allowedDomain) or a subdomain boundary (hostname.endsWith('.' + allowedDomain)); update the logic inside rowHasAllowedDomain that iterates rowSourceUrls and uses urlHostname so it only returns true for exact domain or proper subdomain matches.

MMeteorL · 2026-05-25T17:53:38Z

The benchmark testing is just a pipelined replication of the agent behavior, instead of going through the same workflow. Thus using it as a reference to agent experimentation instead of precise agent performance is a better use for the testing.

adding an isolated benchmarking system, and minor documentation changes

693d132

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Merge branch 'main' into benchmark_testing

6592646

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding an isolated benchmarking system, and minor documentation changes#90

adding an isolated benchmarking system, and minor documentation changes#90
MMeteorL wants to merge 2 commits into
mainfrom
benchmark_testing

MMeteorL commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Walkthrough

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

MMeteorL commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MMeteorL commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

MMeteorL commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 25, 2026 •

edited

Loading