Skip to content

Add opt-in playwright_execute tool to the CUA agent and CLI#33

Open
dprevoznik wants to merge 10 commits into
mainfrom
hypeship/cua-playwright-execute-tool
Open

Add opt-in playwright_execute tool to the CUA agent and CLI#33
dprevoznik wants to merge 10 commits into
mainfrom
hypeship/cua-playwright-execute-tool

Conversation

@dprevoznik

@dprevoznik dprevoznik commented Jun 19, 2026

Copy link
Copy Markdown

Summary

Adds an opt-in playwright_execute tool so the model can run Playwright/TypeScript directly against the live browser session — for steps that are awkward as raw pointer/keyboard actions (precise DOM reads, form fills, data extraction, waiting on selectors). It sits alongside the existing computer-use tools rather than replacing them.

Execution runs server-side in the browser VM via the Kernel SDK (client.browsers.playwright.execute), which exposes page/context/browser and lets the code return a JSON-serializable value. No CDP wiring or local Playwright is needed.

It is modeled directly on the existing computer_use_extra navigation tool:

  • @onkernel/cua-aiplaywright_execute tool name, { code, timeout_sec? } schema, and createCuaPlaywrightToolDefinition().
  • @onkernel/cua-agentInternalComputerTranslator.executePlaywright(), a playwright executor in tools.ts, and the playwright option threaded through CuaAgent/CuaAgentHarness. The tool name is added to keepToolNames() so provider payload hooks don't strip it.
  • @onkernel/cua-cli--playwright flag and a TUI tool-call preview.

Behavior, per the decisions on this:

  • Opt-in: off by default; enable with --playwright (CLI) or playwright: true (library).
  • Result shape: returns result (when present), plus stdout/stderr only when non-empty, and error on success: false. A reported failure comes back as tool content (not thrown) so the model can adapt; only a thrown SDK error surfaces as a tool error. Library consumers can also read the structured result/stdout/stderr/error off PlaywrightDetails without re-parsing tool content text.
  • Timeout: timeout_sec follows the documented server contract (default 60s, max 300s); values are clamped client-side so the model can't violate the cap.
  • Screenshot: appends a fresh screenshot after execution so the screenshot loop stays coherent.

Naming note

The model-facing wire name is playwright_execute (snake_case, consistent with computer_use_extra / computer_batch), the CLI flag is --playwright, and the option is playwright.

Model support

The tool is advertised as a generic function tool, so any provider that supports function calling alongside its native computer-use API can call it. The playwright_execute name is added to keepToolNames() so provider payload hooks that filter unknown tools (tzafon/yutori) won't strip it. Verified e2e against:

  • Anthropic (claude-opus-4-7)
  • Tzafon (tzafon.northstar-cua-fast-1.6)
  • Yutori (n1.5-latest)

OpenAI (gpt-5.5) and Google (gemini-3-flash-preview) are unit-tested but not yet e2e-verified against a live browser.

Docs

packages/agent/README.md, packages/ai/README.md, and packages/cli/README.md updated alongside the code.

Test plan

  • npm run typecheck (workspace) passes
  • @onkernel/cua-agent suite passes, incl. 3 new tests (tool synthesized when enabled; execution formats result/stdout + appends screenshot; failure surfaces as content without throwing)
  • @onkernel/cua-ai (88) and @onkernel/cua-cli (37) suites pass
  • Manual smoke against a live Kernel browser (cua --playwright) on three providers:
    • Anthropic (claude-opus-4-7) — happy path returned result: {"h1":"Example Domain","title":"Example Domain"} in one turn; details carried the structured result object.
    • Tzafon (tzafon.northstar-cua-fast-1.6) — same one-turn happy path. Confirms keepToolNames() correctly preserves the tool through tzafon's payload hook.
    • Yutori (n1.5-latest) — recovered from a TypeError (page.querySelector is not a function) and a ReferenceError (document not defined) by reading the failure-as-content stderr/error blocks, then arrived at the correct page.evaluate(...) pattern. Confirms the failure-as-content design closes the iteration loop.
  • Failure path verified during the Yutori smoke: success: false with the Playwright stderr/error came back as tool content (not thrown), screenshot still appended, model read it and adapted.

🤖 Generated with Claude Code


Note

Medium Risk
New remote code execution path against the live browser session when opt-in is enabled; failures are mostly surfaced as tool content but SDK errors can still throw.

Overview
This PR adds an opt-in Playwright escape hatch alongside existing computer-use tools. When enabled (playwright: true on CuaAgent / CuaAgentHarness, or cua --playwright), the model gets a playwright_execute tool that runs Playwright/TypeScript in the live Kernel browser via client.browsers.playwright.execute.

@onkernel/cua-ai defines the tool contract: CuaPlaywrightSchema (code, optional timeout_sec), CUA_PLAYWRIGHT_TOOL_NAME, and createCuaPlaywrightToolDefinition().

@onkernel/cua-agent wires execution: InternalComputerTranslator.executePlaywright() (timeout clamped to 300s), a playwright executor in tools.ts that returns structured PlaywrightDetails and model-facing content (result, stdout/stderr, errors without throwing on SDK-reported failure), and a post-call screenshot. playwright_execute is included in keepToolNames() so provider payload hooks do not strip it. Extra-tool assembly is generalized from navigation-only to withExtraTools.

@onkernel/cua-cli passes --playwright through harness setup and shows truncated code in the TUI tool-call preview. Docs and tests cover synthesis, happy path, and failure-as-content behavior. CLI package version bumps to 0.1.1 in the lockfile.

Reviewed by Cursor Bugbot for commit d746e8c. Bugbot is set up for automated code reviews on this repo. Configure here.

dprevoznik and others added 8 commits June 19, 2026 21:51
Exposes a tool that runs Playwright/TypeScript directly against the
browser session (via the Kernel SDK browsers.playwright.execute) for
steps that are awkward as raw pointer/keyboard actions. Modeled on the
existing computer_use_extra navigation tool: defined in cua-ai, executed
through the translator, gated by a `playwright` option, and added to
keepToolNames so providers retain it in the payload. Enable with the
`--playwright` CLI flag. Returns result/stdout/stderr and appends a fresh
screenshot so the screenshot loop stays coherent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop misleading "Defaults to 60" from timeout_sec description; the
  actual default lives in the Kernel SDK, not here.
- Expose result/stdout/stderr/error on PlaywrightDetails so library
  consumers can branch on the structured execution result without
  re-parsing tool content text.
- Guard formatPlaywrightResult against non-JSON-serializable returns
  (e.g. BigInt, circular refs) so a successful Playwright run never
  becomes a tool-level error.
- Sync package-lock.json to match the cua-cli 0.1.1 bump in a7cdc07.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Locals don't persist across calls but the browser session does. Without
this, a model could write code in call N assuming variables from call
N-1 are still in scope.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier review feedback dropped "Defaults to 60" out of a worry that the
default lived in the SDK and could drift. The kernel.sh docs put both
the default (60s) and the cap (300s) on the server, so the description
is the authoritative place to surface them — the model can't choose a
sensible timeout without that anchor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Schema description tells the model "max 300" but nothing enforced it.
A model that ignored the bound would have hit a confusing SDK-level
failure depending on server behavior; this clamp keeps the client
honest to the documented contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- packages/agent: list playwright option alongside computerUseExtra and
  add a paragraph explaining the tool's behavior and tested-models scope.
- packages/ai: list the new tool-definition factory, schema, constants,
  and CuaPlaywrightInput type in the API surface index.
- packages/cli: document --playwright with a short explainer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
formatPlaywrightResult's JSON.stringify try/catch guarded against
non-serializable values, but execution.result came from the SDK after
a JSON round trip through the wire — anything that survived that is
already JSON-safe, so the catch arm is unreachable.

The executePlaywright timeout chain checked typeof === "number" (dead,
the parameter is TS-typed number | undefined) and Number.isFinite
(redundant — timeoutSec > 0 already rejects NaN, and Math.min handles
Infinity).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Empirical results show CUA-specialized providers (Tzafon, Yutori) do
emit playwright_execute calls — earlier docs were overly cautious.
Yutori in particular demonstrates the failure-as-content design well:
it iterated through two wrong-API attempts (page.querySelector, bare
document) before reading the stderr/error blocks and landing on
page.evaluate(), which throwing would have prevented.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dprevoznik dprevoznik marked this pull request as ready for review June 20, 2026 19:29
@firetiger-agent

Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: PR is in the kernel repo but affects the CUA (computer-use agent) service; unclear if this qualifies as a monitored service within the kernel mono repo—please confirm or add the kernel:cua label to opt in.

To monitor this PR anyway, reply with @firetiger monitor this.

Comment thread packages/agent/src/tools.ts Outdated
Comment thread packages/agent/src/tools.ts
Matches executeBatchTool's shape: the trailing translator.screenshot()
lives inside the same try/catch as the underlying work, so any failure
in the pipeline produces a single wrapped tool error rather than
diverging based on which step failed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 565fe01. Configure here.

Comment thread packages/cli/src/tui/message-list.ts
Comment thread packages/agent/src/translator/translator.ts
Comment thread packages/agent/src/tools.ts
Comment thread packages/agent/src/tools.ts
- executePlaywright: timeout_sec values below 1s previously truncated
  to 0 and were forwarded to the SDK, which differs from omitting the
  field. Floor the truncated value at 1s; anything sub-second falls
  back to "use server default".
- Document PlaywrightDetails fields so library consumers know what
  each one means without reading the executor source.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant