A computer-use CLI for agents (and TUI for humans) built on pi-agent.
cua "go to news.ycombinator.com and tell me the top 3 story titles"cua provisions a Kernel cloud browser, turns the model's computer-use tool calls into real mouse/keyboard/scroll/screenshot actions, and streams the result back to your terminal.
Every frontier model now ships its own first-party "computer use" tool:
- OpenAI gpt-5.5: a built-in
computertool that emits actions like{type:"click", x, y},{type:"scroll", x, y, scroll_x, scroll_y},{type:"keypress", keys:[...]}, … - Anthropic claude-opus-4-7: a built-in
computer_20251124tool that emits{action:"left_click", coordinate:[x, y]},{action:"scroll", scroll_direction, scroll_amount}, … - Google gemini-2.5-pro / gemini-3.x: a set of predefined
computer-use functions (
click_at,type_text_at,scroll_at,navigate,go_back, …) with 0-1000 normalized coordinates. - Yutori Navigator n1 / n1.5: OpenAI-compatible
chat.completionsresponses with built-in browser actiontool_callslikeleft_click,goto_url,type, andscrollin 0-1000 normalized coordinates.
All of them expect you to:
- Run a real browser somewhere (locally is annoying, on a server is hard).
- Translate every action into an actual SDK call against that browser.
- Capture a fresh screenshot after each action and feed it back to the model so it can verify what happened and plan the next step.
- Keep doing this in a loop until the task is done.
cua does all of this for you. The repo is structured as several focused npm packages so the per-provider plumbing is also reusable outside of this binary (e.g. by agents of your own spun up via kernel/cli templates).
packages/
├── ai/ # @onkernel/cua-ai - CUA model catalog + tool schemas + provider adapters (on npm)
├── agent/ # @onkernel/cua-agent - CuaAgent/CuaAgentHarness Kernel-browser execution loop (on npm)
└── cli/ # @onkernel/cua-cli - the `cua` binary; built on cua-agent + cua-ai
Building your own agent? Start here: packages/agent
(@onkernel/cua-agent) — CuaAgent/CuaAgentHarness run the full
computer-use loop against a Kernel browser. It sits on
packages/ai (@onkernel/cua-ai), the model layer with the
curated computer-use model catalog, canonical tool schemas, and per-provider
adapters on top of pi-ai; reach for cua-ai directly only when you bring your
own execution. Both are published to npm.
flowchart LR
ai[("@onkernel/cua-ai")]
agent[("@onkernel/cua-agent")]
cli[("@onkernel/cua-cli")]
pi[("pi-agent-core / pi-ai / pi-tui / pi-coding-agent")]
sdk[("@onkernel/sdk")]
ai --> agent
agent --> cli
ai --> cli
pi --> agent
pi --> cli
sdk --> agent
sdk --> cli
| Package | What it ships |
|---|---|
@onkernel/cua-ai |
Computer-use model catalog (getCuaModel/listCuaModels), canonical CUA tool schemas, and provider adapters/runtime specs built on pi-ai. On npm. |
@onkernel/cua-agent |
CuaAgent/CuaAgentHarness classes that execute cua-ai tool calls against a Kernel browser, screenshot loop included. On npm. |
@onkernel/cua-cli |
The cua binary: argv parsing, sessions, skills, JSONL output, pi-tui front-end. |
git clone https://github.com/kernel/cua
cd cua
npm install
# run the CLI directly from source (no global install required):
npx tsx packages/cli/src/cli.ts --help
# if you want `cua` on $PATH from any directory, add a shell function to
# your rc that pins the repo location while preserving the caller's cwd
# (so `--out`, transcript bucketing, and `.agents/skills` discovery use
# the directory you invoked from), e.g. in ~/.bashrc:
# CUA_REPO=/absolute/path/to/cua
# cua() { "$CUA_REPO/node_modules/.bin/tsx" "$CUA_REPO/packages/cli/src/cli.ts" "$@"; }
# set API keys via env vars
export OPENAI_API_KEY=sk-... # for gpt-5.5
export ANTHROPIC_API_KEY=sk-ant-... # for claude-opus-4-7
export GOOGLE_API_KEY=... # for gemini-3-flash-preview
export YUTORI_API_KEY=yt_... # for n1.5-latest
export KERNEL_API_KEY=sk_... # always required
# single-shot
cua -p "Open https://news.ycombinator.com and tell me the top story"
# list supported model ids
cua models
# Claude
cua -p --model claude-opus-4-7 "Same prompt"
# Gemini 3 Flash (built-in computer use)
cua -p --model gemini-3-flash-preview "Same prompt"
# Yutori Navigator
cua -p --model n1.5-latest "Same prompt"
# interactive TUI (default mode)
cua
cua "summarize https://news.ycombinator.com"
# agent-friendly subcommands (one-shot, see "Agent-friendly subcommands" below
# for the full surface and named-session workflow)
cua open https://github.com/login
cua click "Sign in"
cua url
cua screenshot --out shot.png
# resume the most recent session for this cwd (fresh browser, prior context)
cua -c "now click on the second result"
# JSONL events for scripting
cua -p -o jsonl "open example.com and tell me the heading"- Model layer —
@onkernel/cua-aiowns the curated computer-use model catalog (getCuaModel/listCuaModels), the canonical CUA tool-call schemas, and per-provider adapters on top ofpi-aiso every model emits the sameCuaActionvocabulary. - Execution layer —
@onkernel/cua-agentwrapspi-agent-core'sAgent/AgentHarness.CuaAgentHarnessruns the prompt/screenshot/tool loop against a Kernel browser: it dispatches canonical CUA tool calls into Kernel SDKbrowsers.computer.*calls and captures a fresh screenshot back to the model on every turn. - CLI —
@onkernel/cua-cliassembles aCuaAgentHarnessfrom command-line flags, env-var-based API keys, aJsonlSessionRepofor transcripts, and pi skills; renders the result either as plain text (--print), JSONL events (-o jsonl), or an interactive pi-tui front-end. - Browser — a fresh Kernel cloud browser session per run (or per resume) with optional named profile load/save. Every screenshot the model sees is a real PNG of a real browser tab.
See docs/architecture.md for the full
end-to-end flow.
See packages/cli/README.md for the
full CLI reference, env-var configuration, and model selection.
Highlights:
-p/--printfor single-shot mode;-o jsonlfor structured output.cua modelsto list supported-m/--modelvalues and their providers.-m/--model <id>to choose one of those supported models.-s/--session-name <name>to reuse acua session start-allocated Kernel browser across calls.-c/--continue,-r/--resume,--session <ref>for transcript resume.--skill <path>//skill:<name>for Agent Skills (defaults:~/.agents/skills/,<cwd>/.agents/skills/).--image-protocol/CUA_IMAGE_PROTOCOLto force inline screenshot rendering (kitty/iterm2/none/auto; Ghostty / WezTerm are auto-detected askitty).
Each subcommand below is one-shot: it provisions a Kernel browser, runs the action, prints a compact result on stdout, and exits with a deterministic code. Designed for shell agents to chain.
| Subcommand | Result on stdout | Exit codes |
|---|---|---|
cua open <url> |
ok |
0 ok, 2 error |
cua click "<description>" |
ok clicked (x, y) or not_found <reason> |
0, 1 not_found, 2 error |
cua type "<target>" "<text>" |
ok typed or not_found <reason> |
0, 1 not_found, 2 error |
cua press <key> [<key>...] |
ok pressed |
0 ok, 2 error |
cua observe ["<question>"] |
the description / answer | 0 ok, 2 error |
cua url |
the current URL | 0 ok, 2 error |
cua screenshot --out <file|-> |
the path (or (stdout) when --out -) |
0 ok, 2 error |
cua do "<instruction>" |
the assistant's final text | 0 ok, 2 error |
By default each call provisions a fresh browser, so the second call can't see anything the first call did. For multi-step workflows, use a named session.
cua session start login # provisions a Kernel browser, prints `name=login`
cua -s login open https://github.com/login
cua -s login type "email field" "$EMAIL"
cua -s login type "password field" "$PASSWORD"
cua -s login click "Sign in"
cua -s login url # stdout: the post-login URL
cua session stop login # tears down the Kernel browserInspect:
cua session list # tab-formatted: NAME, KERNEL_ID, AGE, LIVE_URL
cua session show login # full JSON: kernel_session_id, live_url, transcript_path, ...-s <name> works for ALL modes (action subcommands, --print, the
interactive TUI). Liveness is checked before each attach: if the Kernel
browser timed out, the call fails with a clear "session no longer
alive" error suggesting cua session stop <name> && cua session start <name>.
Named-session metadata lives in $XDG_DATA_HOME/cua/named-sessions/<name>.json
(default ~/.local/share/cua/named-sessions/).
Every --print, interactive TUI, and -s <name> invocation persists a
JSONL transcript by default — useful for analyzing or self-improving
agent behavior.
Where: $XDG_DATA_HOME/cua/sessions/<cwd-hash>/<id>.jsonl (default
~/.local/share/cua/sessions/). For named sessions, the exact path is
in the transcript_path field of cua session show <name>.
Format: one JSON object per line. Roles: user, assistant,
toolResult (from pi-coding-agent's SessionManager). There's also a
custom cua-browser entry written once per session with
kernel_session_id / live_url / profile_id.
Opting out: --no-session keeps the run in-memory only. One-shot
action subcommands (without -s) also skip the transcript, since
they're already self-contained.
Analyzing: anything that reads JSONL works. A few jq starters:
TRANSCRIPT=~/.local/share/cua/sessions/<cwd>/<id>.jsonl
# Every tool call the agent made, in order
jq -c 'select(.role == "assistant") | .content[]?
| select(.type == "tool_use") | {name, input}' \
"$TRANSCRIPT"
# Largest tool-result screenshot (handy when chasing context-window blowups)
jq -c 'select(.role == "toolResult") | .content[]?
| select(.type == "image") | {len: (.data | length)}' \
"$TRANSCRIPT" | sort -t: -k2 -n | tail -1
# Final assistant text (the answer)
jq -r 'select(.role == "assistant") | .content[]?
| select(.type == "text") | .text' "$TRANSCRIPT" | tail -1--print -o jsonl is a separate live-event stream (one event per line
on stdout, different schema). Both are useful for analysis but they're
NOT the same thing: the -o jsonl stream describes turns / tool calls
/ deltas as they happen; the transcript JSONL is the persisted message
history pi-coding-agent's SessionManager writes.
cua follows the cross-agent ~/.agents/skills/
emerging standard. Skills loaded from any of these locations are
auto-discovered (first wins on name collision):
- Explicit
--skill <path>flags (file or directory; repeatable). ~/.agents/skills/(user-global).<cwd>/.agents/skills/(project-local).
Each skill's name, description, and file location are added to
the system prompt. The model is instructed to use the read tool to
load a skill's full body when its description matches the task — only
descriptions and locations live in the prompt by default, so the prompt
stays small no matter how many skills you have.
To force-load a skill body inline on a single turn, prefix the prompt
with /skill:<name> (works in both --print and the interactive TUI):
cua -p "/skill:my-workflow open https://..."Disable discovery entirely with --no-skills / -ns.
This repo ships a skills/cua-cli/SKILL.md aimed at OTHER agents
(Claude Code, Cursor, pi-coding-agent, etc.) that want to drive cua
as a CLI subcommand. To install it globally:
mkdir -p ~/.agents/skills
ln -s "$(pwd)/skills/cua-cli" ~/.agents/skills/cua-cliskills/
└── cua-cli/SKILL.md # skill aimed at OTHER agents driving cua via shell
packages/
├── ai/ # @onkernel/cua-ai — model layer (see packages/ai/README.md)
├── agent/ # @onkernel/cua-agent — Kernel-browser execution layer (see packages/agent/README.md)
└── cli/ # @onkernel/cua-cli — the `cua` binary (see packages/cli/README.md)
- Auto-respawn dead Kernel sessions when
-s <name>is used (today we refuse with a clear error and ask the user to re-session start). --localDocker-backed browser as an alternative to Kernel cloud.- Anthropic
hold_key/zoomaction support. - pi-tui SelectList-based picker for
-rinstead of plain readline.
MIT