Problem
The playwright-cli skill description is the only signal an LLM reads before deciding whether to invoke the skill. The current description:
Automate browser interactions, test web pages and work with Playwright tests.
causes the model to miss roughly 20% of valid invocations when it cannot rely on prior knowledge of the skill name.
Benchmark evidence
Tested 5 description variants against a 50-prompt set (40 should-trigger / 10 should-not-trigger) using OpenAI Codex as the routing judge. To eliminate the prior-knowledge confound, the skill was renamed to the fictional name headless-pilot — forcing Codex to route purely from description text, not from training data about playwright.
| Variant |
Recall |
Precision |
F1 |
| baseline (current) |
0.800 |
1.000 |
0.889 |
| v1 explicit triggers |
0.950 |
1.000 |
0.974 |
| v2 intent-first |
0.925 |
1.000 |
0.961 |
| v3 verb-dense |
0.950 |
1.000 |
0.974 |
| v4 "Use when" style |
0.975 |
1.000 |
0.987 |
All variants had perfect precision (no false positives). The gap is entirely in recall.
Prompts the baseline misses
These 8 prompts have obvious playwright-cli solutions but don't contain the words "automate" or "browser interactions":
run tests/checkout.spec.ts in headed mode
debug why tests/auth.spec.ts line 42 is broken
record a demo video of the checkout flow
export the pricing page as a PDF
record a video with chapter titles for the onboarding demo
mock the /api/users endpoint to return an empty array
intercept network requests and log which ones are slow
- (one additional borderline case)
Proposed fix
PR #409 changes only the description field in skills/playwright-cli/SKILL.md to the winning v4 style:
description: >
Use when the user says: "go to this URL", "click this button", "fill this
form", "take a screenshot", "scrape this page", "log in to X", "run my
Playwright tests", "this test is failing", "write a test for", "mock this
API", "record a demo". The skill opens a live browser and drives it.
This quotes the natural-language trigger phrases users actually type, giving the routing model a direct pattern-match signal rather than requiring it to infer that "record a demo" ≡ "automate browser interactions".
Problem
The playwright-cli skill description is the only signal an LLM reads before deciding whether to invoke the skill. The current description:
causes the model to miss roughly 20% of valid invocations when it cannot rely on prior knowledge of the skill name.
Benchmark evidence
Tested 5 description variants against a 50-prompt set (40 should-trigger / 10 should-not-trigger) using OpenAI Codex as the routing judge. To eliminate the prior-knowledge confound, the skill was renamed to the fictional name
headless-pilot— forcing Codex to route purely from description text, not from training data about playwright.All variants had perfect precision (no false positives). The gap is entirely in recall.
Prompts the baseline misses
These 8 prompts have obvious playwright-cli solutions but don't contain the words "automate" or "browser interactions":
run tests/checkout.spec.ts in headed modedebug why tests/auth.spec.ts line 42 is brokenrecord a demo video of the checkout flowexport the pricing page as a PDFrecord a video with chapter titles for the onboarding demomock the /api/users endpoint to return an empty arrayintercept network requests and log which ones are slowProposed fix
PR #409 changes only the
descriptionfield inskills/playwright-cli/SKILL.mdto the winning v4 style:This quotes the natural-language trigger phrases users actually type, giving the routing model a direct pattern-match signal rather than requiring it to infer that "record a demo" ≡ "automate browser interactions".