Skip to content

adding an isolated benchmarking system, and minor documentation changes#90

Open
MMeteorL wants to merge 2 commits into
mainfrom
benchmark_testing
Open

adding an isolated benchmarking system, and minor documentation changes#90
MMeteorL wants to merge 2 commits into
mainfrom
benchmark_testing

Conversation

@MMeteorL
Copy link
Copy Markdown
Collaborator

An isolated benchmark testing system for the current Mastra agent, based on Edward's implementation of benchmark testing on this branch: codex/mastra-populate-benchmark.

Made some minor documentation changes for running the system with npm install first in frontend/ folder, and adding a structural documentation for the current agent system.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ecdcd7ba-082d-40f4-ab37-978ffe5e7eb8

📥 Commits

Reviewing files that changed from the base of the PR and between 693d132 and 6592646.

📒 Files selected for processing (2)
  • .gitignore
  • README.md
✅ Files skipped from review due to trivial changes (2)
  • .gitignore
  • README.md

📝 Walkthrough

Walkthrough

This pull request adds a complete benchmarking infrastructure for evaluating dataset-collection agents. The changes span project configuration (gitignore, README), a core tracing library and stdout contract, two adapters (smoke and Mastra populate), prompt and answer-key datasets, a comprehensive run-benchmark CLI with scoring/rescoring and reporting, tests for failure/scoring logic, and architecture documentation describing the two-phase populate pipeline and orchestrator/subagent interactions.

Sequence Diagram(s)

sequenceDiagram
  participant Runner
  participant RunBenchmark
  participant OrchestratorAgent
  participant InvestigateAgent
  participant search_web
  participant fetch_page
  participant InMemoryDataset
  participant ArtifactWriter

  Runner->>RunBenchmark: start(system, prompt, env)
  RunBenchmark->>OrchestratorAgent: spawn orchestrator (prompt, tools, env)
  OrchestratorAgent->>InvestigateAgent: call investigate_row tool (context)
  InvestigateAgent->>search_web: perform search_web(query)
  InvestigateAgent->>fetch_page: perform fetch_page(url)
  InvestigateAgent->>InMemoryDataset: insert_row(row)
  InvestigateAgent->>ArtifactWriter: emit session trace (snapshot)
  OrchestratorAgent->>RunBenchmark: return payload (stdout or artifact)
  RunBenchmark->>ArtifactWriter: write parsed-output.json, stdout/stderr, summary
Loading

Possibly related PRs

  • tinyfish-io/bigset#81: The benchmark adapters implement and parse the investigate_row subagent workflow using the same INSERTED/SUMMARY/CLUES/REASON output contract that the backend two-tier populate agents introduced in PR #81.

Suggested reviewers

  • hwennnn
  • simantak-dabhade
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: adding a benchmarking system and updating documentation. It aligns with the substantial code additions for benchmarks and README/docs updates.
Description check ✅ Passed The description is related to the changeset, mentioning the benchmark testing system based on Edward's implementation and documentation changes for frontend npm install and agent documentation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch benchmark_testing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs`:
- Around line 355-369: The current insert path (using data -> cleanDataKeys ->
store.insert in this block) allows rows where every field is empty (e.g., "",
null, undefined) to be counted; add a validation after computing cleaned that
rejects the request when all values are empty: compute Object.values(cleaned)
and check that at least one value is "meaningful" (not null/undefined and not an
empty string/empty array), return { success: false, error: 'row must contain at
least one non-empty value' } instead of inserting, otherwise proceed to
store.insert and trace.recordInsert as before.
- Around line 35-99: The call to trace.initArtifacts runs before the try/finally
that ensures trace.restoreConsole() and trace.finalize() run, so failures there
leave console patched and artifacts unfinalized; move or wrap await
trace.initArtifacts({...}) inside the same try block that calls
trace.restoreConsole()/trace.finalize() (i.e., ensure createBenchmarkTrace(),
then enter try { await trace.initArtifacts(...); ... } finally {
trace.restoreConsole(); trace.finalize(); }) so any exception during
initArtifacts still triggers the cleanup.

In `@benchmarks/dataset-agent/README.md`:
- Around line 8-12: The README currently switches to the backend working
directory with "cd backend && npm ci" but the subsequent run command uses
root-relative paths ("node benchmarks/dataset-agent/run-benchmark.mjs" and the
adapter path "benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs"),
causing path resolution to fail; fix by making the working directory usage
consistent: either keep the "cd backend" step and update the run command to
reference the backend directory (prefix paths with "../" or "./backend/" as
appropriate) or remove the "cd backend" and run "npm ci" using "npm ci --prefix
backend" so the run command can stay root-relative—ensure the updated
instructions consistently reference the same cwd for the run-benchmark.mjs
command and the mastra-populate-adapter.mjs adapter path.

In `@benchmarks/dataset-agent/run-benchmark.mjs`:
- Around line 570-584: The current timeout handler only sends SIGTERM
(child.kill("SIGTERM")) which can hang if the child ignores it; update the
timeout logic in the setTimeout callback to send SIGTERM first, then schedule a
short secondary timer (grace period) that, if the child is still running, sends
SIGKILL (child.kill("SIGKILL")), and ensure both timers are cleared when the
child emits "close" so resolve({ stdout, stderr, exitCode, timedOut }) always
runs; reference the existing timeout variable, timedOut flag,
child.kill("SIGTERM") call, and the child.on("close", ...) resolver when
implementing this change.
- Around line 418-429: The config currently seeds targetContract with a spread
of defaultTargetContract which gets merged over entityBenchmarkContract in
resolveTargetContract and unintentionally relaxes stricter entity gates; change
the initialization in run-benchmark.mjs to not prepopulate config.targetContract
with {...defaultTargetContract} (use null/undefined or omit the property) so
resolveTargetContract can correctly apply entityBenchmarkContract and
defaultTargetContract in the intended order, or alternatively adjust
resolveTargetContract merge order to ensure entityBenchmarkContract overrides
defaults (refer to config.targetContract and the resolveTargetContract
function).
- Around line 1442-1447: rowHasAllowedDomain currently treats hostnames as
allowed if urlHostname(url).endsWith(allowedDomain), which allows lookalike
matches (e.g. notopenai.com). Fix by normalizing the hostname via
urlHostname(url) and matching either exact equality (hostname === allowedDomain)
or a subdomain boundary (hostname.endsWith('.' + allowedDomain)); update the
logic inside rowHasAllowedDomain that iterates rowSourceUrls and uses
urlHostname so it only returns true for exact domain or proper subdomain
matches.
- Around line 1260-1275: The current bare patterns in blockerPatterns
(especially "tinyfish_api_key" and "openrouter_api_key") cause false positives
when those strings appear in normal text; update those two entries to be more
specific by requiring word boundaries and an assignment/contextual indicator
(e.g., require \b around the key name and expect an assignment or value within a
short window like =, :, or quotes within ~0–80 chars). Edit the blockerPatterns
array (the tinyfish_api_key and openrouter_api_key entries) to replace the bare
literals with regexes such as requiring
\b(?:tinyfish_api_key|openrouter_api_key)\b followed by a nearby
assignment/value pattern so only actual credential-like occurrences trigger the
blocker.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8fb3815-8f14-4139-88fb-7ce86a7aba0d

📥 Commits

Reviewing files that changed from the base of the PR and between 1e9cc8b and 693d132.

📒 Files selected for processing (12)
  • .gitignore
  • README.md
  • benchmarks/dataset-agent/README.md
  • benchmarks/dataset-agent/adapters/.gitignore
  • benchmarks/dataset-agent/adapters/lib/mastra-benchmark-trace.mjs
  • benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs
  • benchmarks/dataset-agent/adapters/smoke-adapter.mjs
  • benchmarks/dataset-agent/answer-keys-entity.mjs
  • benchmarks/dataset-agent/prompts.json
  • benchmarks/dataset-agent/run-benchmark.mjs
  • benchmarks/dataset-agent/run-benchmark.test.mjs
  • docs/data-collection-agents.md

Comment on lines +35 to +99
const trace = createBenchmarkTrace();
const columns = requiredColumns.map((columnName) => ({
name: columnName,
type: inferPopulateColumnType(columnName),
description: `Benchmark requested column for ${promptQuality} prompt.`,
}));
const datasetName = `benchmark_${safeIdSegment(promptId)}`;
const maxSteps = Number(process.env.BIGSET_MASTRA_BENCHMARK_MAX_STEPS ?? "80");
const targetRows = Number(process.env.BIGSET_MASTRA_BENCHMARK_TARGET_ROWS ?? "20");

const store = createRowStore(trace);
const metrics = {
searchCallCount: 0,
fetchCallCount: 0,
browserCallCount: 0,
agentRunCount: 0,
agentStepCount: 0,
};

const authContext = {
authorizedUserId: "benchmark",
workflowRunId: `benchmark-${promptId}-${Date.now()}`,
};
const authorizedDatasetId = `benchmark-${safeIdSegment(promptId)}`;

const openrouter = createOpenRouter({
apiKey: process.env.OPENROUTER_API_KEY,
});

const ORCHESTRATOR_INSTRUCTIONS = `You fill datasets by finding real leads and handing them to subagents for deep research.

1. Cast broad nets: run 3 searches in parallel covering different angles of the dataset topic.
Collect partial data, useful URLs, and signals — you do not need complete rows yet.

2. Hand off leads: call investigate_row for each promising lead.
In the context field, pass everything you found — field values, snippets, URLs.
- First batch: exactly 3 in parallel. Wait for all to finish and read every clue.
- Second batch: up to 10 in parallel. Wait for all to finish and read every clue.
- All subsequent batches: no limit — spawn as many as you have good leads.

3. Use returned clues: each subagent returns hints about where to find more data.
Feed those clues into the next batch of investigate_row calls.

4. Keep going until you have 20 inserted rows or have exhausted real leads.

Do not insert rows yourself — only investigate_row subagents can write to the dataset.
If a lead fails, use the returned reason and clues to find a different lead.`;

const agentPrompt = buildPopulatePrompt();
let validationIssues = [];
let orchestratorError = null;

await trace.initArtifacts({
promptId,
datasetName,
authorizedDatasetId,
userPrompt: prompt,
orchestratorPrompt: agentPrompt,
requiredColumns,
maxSteps,
targetRows,
});

trace.setPayloadSnapshot(() => buildBenchmarkPayload());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard initArtifacts with the same try/finally lifecycle.

createBenchmarkTrace() patches console.log immediately, but await trace.initArtifacts(...) runs before the try/finally block. If artifact initialization fails, restoreConsole() and finalize() never run.

💡 Suggested control-flow fix
-const trace = createBenchmarkTrace();
+const trace = createBenchmarkTrace();
 ...
-await trace.initArtifacts({
-  promptId,
-  datasetName,
-  authorizedDatasetId,
-  userPrompt: prompt,
-  orchestratorPrompt: agentPrompt,
-  requiredColumns,
-  maxSteps,
-  targetRows,
-});
-
-trace.setPayloadSnapshot(() => buildBenchmarkPayload());
-
 try {
+  await trace.initArtifacts({
+    promptId,
+    datasetName,
+    authorizedDatasetId,
+    userPrompt: prompt,
+    orchestratorPrompt: agentPrompt,
+    requiredColumns,
+    maxSteps,
+    targetRows,
+  });
+  trace.setPayloadSnapshot(() => buildBenchmarkPayload());
+
   const agent = buildPopulateAgent();
   metrics.agentRunCount += 1;
   const startedAt = new Date().toISOString();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs` around lines
35 - 99, The call to trace.initArtifacts runs before the try/finally that
ensures trace.restoreConsole() and trace.finalize() run, so failures there leave
console patched and artifacts unfinalized; move or wrap await
trace.initArtifacts({...}) inside the same try block that calls
trace.restoreConsole()/trace.finalize() (i.e., ensure createBenchmarkTrace(),
then enter try { await trace.initArtifacts(...); ... } finally {
trace.restoreConsole(); trace.finalize(); }) so any exception during
initArtifacts still triggers the cleanup.

Comment on lines +355 to +369
if (!data || Object.keys(data).length === 0) {
return {
success: false,
error:
'data is required and must have at least one key. Pass an object like { "Column Name": value }.',
};
}
const cleaned = cleanDataKeys(data);
const row = store.insert(cleaned, meta);
await trace.recordInsert({
sessionId: meta.sessionId,
entityHint: meta.entityHint,
row,
});
return { success: true };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject all-empty inserts to protect benchmark integrity.

Current validation allows { colA: "", colB: "" } (or null-only rows) to be inserted and counted. That can satisfy target row counts without meaningful data.

💡 Suggested validation patch
     execute: async ({ data }) => {
       if (!data || Object.keys(data).length === 0) {
         return {
           success: false,
           error:
             'data is required and must have at least one key. Pass an object like { "Column Name": value }.',
         };
       }
       const cleaned = cleanDataKeys(data);
+      const hasNonEmptyValue = Object.values(cleaned).some(
+        (value) => value !== undefined && value !== null && String(value).trim() !== ""
+      );
+      if (!hasNonEmptyValue) {
+        return {
+          success: false,
+          error: "at least one column must contain a non-empty value.",
+        };
+      }
       const row = store.insert(cleaned, meta);
       await trace.recordInsert({
         sessionId: meta.sessionId,
         entityHint: meta.entityHint,
         row,
       });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (!data || Object.keys(data).length === 0) {
return {
success: false,
error:
'data is required and must have at least one key. Pass an object like { "Column Name": value }.',
};
}
const cleaned = cleanDataKeys(data);
const row = store.insert(cleaned, meta);
await trace.recordInsert({
sessionId: meta.sessionId,
entityHint: meta.entityHint,
row,
});
return { success: true };
if (!data || Object.keys(data).length === 0) {
return {
success: false,
error:
'data is required and must have at least one key. Pass an object like { "Column Name": value }.',
};
}
const cleaned = cleanDataKeys(data);
const hasNonEmptyValue = Object.values(cleaned).some(
(value) => value !== undefined && value !== null && String(value).trim() !== ""
);
if (!hasNonEmptyValue) {
return {
success: false,
error: "at least one column must contain a non-empty value.",
};
}
const row = store.insert(cleaned, meta);
await trace.recordInsert({
sessionId: meta.sessionId,
entityHint: meta.entityHint,
row,
});
return { success: true };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs` around lines
355 - 369, The current insert path (using data -> cleanDataKeys -> store.insert
in this block) allows rows where every field is empty (e.g., "", null,
undefined) to be counted; add a validation after computing cleaned that rejects
the request when all values are empty: compute Object.values(cleaned) and check
that at least one value is "meaningful" (not null/undefined and not an empty
string/empty array), return { success: false, error: 'row must contain at least
one non-empty value' } instead of inserting, otherwise proceed to store.insert
and trace.recordInsert as before.

Comment on lines +8 to +12
cd backend && npm ci

node benchmarks/dataset-agent/run-benchmark.mjs \
--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix working-directory mismatch in run instructions.

Line 8 changes cwd to backend, but Lines 10-12 use root-relative paths. Running those snippets sequentially will fail path resolution.

💡 Suggested doc fix
-cd backend && npm ci
+cd backend && npm ci && cd ..
 
 node benchmarks/dataset-agent/run-benchmark.mjs \
   --system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd backend && npm ci
node benchmarks/dataset-agent/run-benchmark.mjs \
--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
```
cd backend && npm ci && cd ..
node benchmarks/dataset-agent/run-benchmark.mjs \
--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/dataset-agent/README.md` around lines 8 - 12, The README currently
switches to the backend working directory with "cd backend && npm ci" but the
subsequent run command uses root-relative paths ("node
benchmarks/dataset-agent/run-benchmark.mjs" and the adapter path
"benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs"), causing path
resolution to fail; fix by making the working directory usage consistent: either
keep the "cd backend" step and update the run command to reference the backend
directory (prefix paths with "../" or "./backend/" as appropriate) or remove the
"cd backend" and run "npm ci" using "npm ci --prefix backend" so the run command
can stay root-relative—ensure the updated instructions consistently reference
the same cwd for the run-benchmark.mjs command and the
mastra-populate-adapter.mjs adapter path.

Comment on lines +418 to +429
const config = {
promptsPath: defaultPromptsPath,
promptIds: null,
systems: [],
timeoutMs: 10 * 60 * 1000,
inputUsdPer1M: 0.05,
outputUsdPer1M: 0.5,
tinyFishAgentStepUsd: 0.015,
targetContract: { ...defaultTargetContract },
minRequiredCompleteness: defaultTargetContract.minRequiredCompleteness,
minFactualAccuracy: defaultTargetContract.minFactualAccuracy,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Entity benchmark defaults are being unintentionally relaxed.

config.targetContract is initialized with open-ended defaults, then merged over entityBenchmarkContract in resolveTargetContract, which disables the stricter entity gates by default.

💡 Proposed fix
 function parseArgs(args) {
   const config = {
@@
-    targetContract: { ...defaultTargetContract },
+    // CLI overrides only; per-prompt defaults come from resolveTargetContract().
+    targetContract: {},
     minRequiredCompleteness: defaultTargetContract.minRequiredCompleteness,
     minFactualAccuracy: defaultTargetContract.minFactualAccuracy,
   };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const config = {
promptsPath: defaultPromptsPath,
promptIds: null,
systems: [],
timeoutMs: 10 * 60 * 1000,
inputUsdPer1M: 0.05,
outputUsdPer1M: 0.5,
tinyFishAgentStepUsd: 0.015,
targetContract: { ...defaultTargetContract },
minRequiredCompleteness: defaultTargetContract.minRequiredCompleteness,
minFactualAccuracy: defaultTargetContract.minFactualAccuracy,
};
const config = {
promptsPath: defaultPromptsPath,
promptIds: null,
systems: [],
timeoutMs: 10 * 60 * 1000,
inputUsdPer1M: 0.05,
outputUsdPer1M: 0.5,
tinyFishAgentStepUsd: 0.015,
// CLI overrides only; per-prompt defaults come from resolveTargetContract().
targetContract: {},
minRequiredCompleteness: defaultTargetContract.minRequiredCompleteness,
minFactualAccuracy: defaultTargetContract.minFactualAccuracy,
};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/dataset-agent/run-benchmark.mjs` around lines 418 - 429, The
config currently seeds targetContract with a spread of defaultTargetContract
which gets merged over entityBenchmarkContract in resolveTargetContract and
unintentionally relaxes stricter entity gates; change the initialization in
run-benchmark.mjs to not prepopulate config.targetContract with
{...defaultTargetContract} (use null/undefined or omit the property) so
resolveTargetContract can correctly apply entityBenchmarkContract and
defaultTargetContract in the intended order, or alternatively adjust
resolveTargetContract merge order to ensure entityBenchmarkContract overrides
defaults (refer to config.targetContract and the resolveTargetContract
function).

Comment on lines +570 to +584
const timeout = setTimeout(() => {
timedOut = true;
child.kill("SIGTERM");
}, timeoutMs);

child.stdout.on("data", (chunk) => {
stdout += chunk.toString();
});
child.stderr.on("data", (chunk) => {
stderr += chunk.toString();
});
child.on("close", (exitCode) => {
clearTimeout(timeout);
resolve({ stdout, stderr, exitCode: exitCode ?? 1, timedOut });
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Timeout handling can hang when the child ignores SIGTERM.

After timeout, only SIGTERM is sent. If the process doesn’t exit, the promise can remain unresolved indefinitely.

💡 Proposed fix
 function runCommand({ command, timeoutMs, env }) {
   return new Promise((resolve) => {
@@
-    const timeout = setTimeout(() => {
+    let forceKillTimeout;
+    const timeout = setTimeout(() => {
       timedOut = true;
       child.kill("SIGTERM");
+      forceKillTimeout = setTimeout(() => {
+        child.kill("SIGKILL");
+      }, 5_000);
     }, timeoutMs);
@@
     child.on("close", (exitCode) => {
       clearTimeout(timeout);
+      if (forceKillTimeout) clearTimeout(forceKillTimeout);
       resolve({ stdout, stderr, exitCode: exitCode ?? 1, timedOut });
     });
   });
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const timeout = setTimeout(() => {
timedOut = true;
child.kill("SIGTERM");
}, timeoutMs);
child.stdout.on("data", (chunk) => {
stdout += chunk.toString();
});
child.stderr.on("data", (chunk) => {
stderr += chunk.toString();
});
child.on("close", (exitCode) => {
clearTimeout(timeout);
resolve({ stdout, stderr, exitCode: exitCode ?? 1, timedOut });
});
let forceKillTimeout;
const timeout = setTimeout(() => {
timedOut = true;
child.kill("SIGTERM");
forceKillTimeout = setTimeout(() => {
child.kill("SIGKILL");
}, 5_000);
}, timeoutMs);
child.stdout.on("data", (chunk) => {
stdout += chunk.toString();
});
child.stderr.on("data", (chunk) => {
stderr += chunk.toString();
});
child.on("close", (exitCode) => {
clearTimeout(timeout);
if (forceKillTimeout) clearTimeout(forceKillTimeout);
resolve({ stdout, stderr, exitCode: exitCode ?? 1, timedOut });
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/dataset-agent/run-benchmark.mjs` around lines 570 - 584, The
current timeout handler only sends SIGTERM (child.kill("SIGTERM")) which can
hang if the child ignores it; update the timeout logic in the setTimeout
callback to send SIGTERM first, then schedule a short secondary timer (grace
period) that, if the child is still running, sends SIGKILL
(child.kill("SIGKILL")), and ensure both timers are cleared when the child emits
"close" so resolve({ stdout, stderr, exitCode, timedOut }) always runs;
reference the existing timeout variable, timedOut flag, child.kill("SIGTERM")
call, and the child.on("close", ...) resolver when implementing this change.

Comment on lines +1260 to +1275
const blockerPatterns = [
/authentication failed/,
/active subscription/,
/insufficient credits/,
/not enough credits/,
/(?:missing|required|invalid|not configured|not set|unset)[^.]{0,80}api[_ -]?key/,
/api[_ -]?key[^.]{0,80}(?:missing|required|invalid|not configured|not set|unset)/,
/tinyfish_api_key/,
/openrouter_api_key/,
/quota exceeded/,
/rate[_ -]?limit[_ -]?exceeded/,
/benchmark deadline/,
];
return blockerPatterns.some((pattern) => pattern.test(combinedText))
? "Infrastructure/auth/credits blocker."
: null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Infra blocker detection can false-positive on benign payload text.

Bare patterns for tinyfish_api_key / openrouter_api_key mark runs as blocked even when those strings appear in normal docs/output text.

💡 Proposed fix
   const blockerPatterns = [
@@
-    /tinyfish_api_key/,
-    /openrouter_api_key/,
+    /(?:missing|required|invalid|not configured|not set|unset)[^.]{0,80}(?:tinyfish_api_key|openrouter_api_key)/,
+    /(?:tinyfish_api_key|openrouter_api_key)[^.]{0,80}(?:missing|required|invalid|not configured|not set|unset)/,
@@
   ];
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/dataset-agent/run-benchmark.mjs` around lines 1260 - 1275, The
current bare patterns in blockerPatterns (especially "tinyfish_api_key" and
"openrouter_api_key") cause false positives when those strings appear in normal
text; update those two entries to be more specific by requiring word boundaries
and an assignment/contextual indicator (e.g., require \b around the key name and
expect an assignment or value within a short window like =, :, or quotes within
~0–80 chars). Edit the blockerPatterns array (the tinyfish_api_key and
openrouter_api_key entries) to replace the bare literals with regexes such as
requiring \b(?:tinyfish_api_key|openrouter_api_key)\b followed by a nearby
assignment/value pattern so only actual credential-like occurrences trigger the
blocker.

Comment on lines +1442 to +1447
function rowHasAllowedDomain(row, allowedDomains) {
if (!allowedDomains?.length) return true;
const cells = rowCells(row);
return rowSourceUrls(row, cells).some((url) =>
allowedDomains.some((allowedDomain) => urlHostname(url).endsWith(allowedDomain))
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Domain validation accepts lookalike hostnames.

Using endsWith(allowedDomain) allows false positives like notopenai.com matching openai.com.

💡 Proposed fix
 function rowHasAllowedDomain(row, allowedDomains) {
   if (!allowedDomains?.length) return true;
   const cells = rowCells(row);
   return rowSourceUrls(row, cells).some((url) =>
-    allowedDomains.some((allowedDomain) => urlHostname(url).endsWith(allowedDomain))
+    allowedDomains.some((allowedDomain) => {
+      const host = urlHostname(url);
+      return host === allowedDomain || host.endsWith(`.${allowedDomain}`);
+    })
   );
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/dataset-agent/run-benchmark.mjs` around lines 1442 - 1447,
rowHasAllowedDomain currently treats hostnames as allowed if
urlHostname(url).endsWith(allowedDomain), which allows lookalike matches (e.g.
notopenai.com). Fix by normalizing the hostname via urlHostname(url) and
matching either exact equality (hostname === allowedDomain) or a subdomain
boundary (hostname.endsWith('.' + allowedDomain)); update the logic inside
rowHasAllowedDomain that iterates rowSourceUrls and uses urlHostname so it only
returns true for exact domain or proper subdomain matches.

@MMeteorL
Copy link
Copy Markdown
Collaborator Author

The benchmark testing is just a pipelined replication of the agent behavior, instead of going through the same workflow. Thus using it as a reference to agent experimentation instead of precise agent performance is a better use for the testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant