Skip to content

skill: description causes ~20% recall gap in LLM routing (baseline recall=0.800) #410

@pedropaulovc

Description

@pedropaulovc

Problem

The playwright-cli skill description is the only signal an LLM reads before deciding whether to invoke the skill. The current description:

Automate browser interactions, test web pages and work with Playwright tests.

causes the model to miss roughly 20% of valid invocations when it cannot rely on prior knowledge of the skill name.

Benchmark evidence

Tested 5 description variants against a 50-prompt set (40 should-trigger / 10 should-not-trigger) using OpenAI Codex as the routing judge. To eliminate the prior-knowledge confound, the skill was renamed to the fictional name headless-pilot — forcing Codex to route purely from description text, not from training data about playwright.

Variant Recall Precision F1
baseline (current) 0.800 1.000 0.889
v1 explicit triggers 0.950 1.000 0.974
v2 intent-first 0.925 1.000 0.961
v3 verb-dense 0.950 1.000 0.974
v4 "Use when" style 0.975 1.000 0.987

All variants had perfect precision (no false positives). The gap is entirely in recall.

Prompts the baseline misses

These 8 prompts have obvious playwright-cli solutions but don't contain the words "automate" or "browser interactions":

  • run tests/checkout.spec.ts in headed mode
  • debug why tests/auth.spec.ts line 42 is broken
  • record a demo video of the checkout flow
  • export the pricing page as a PDF
  • record a video with chapter titles for the onboarding demo
  • mock the /api/users endpoint to return an empty array
  • intercept network requests and log which ones are slow
  • (one additional borderline case)

Proposed fix

PR #409 changes only the description field in skills/playwright-cli/SKILL.md to the winning v4 style:

description: >
  Use when the user says: "go to this URL", "click this button", "fill this
  form", "take a screenshot", "scrape this page", "log in to X", "run my
  Playwright tests", "this test is failing", "write a test for", "mock this
  API", "record a demo". The skill opens a live browser and drives it.

This quotes the natural-language trigger phrases users actually type, giving the routing model a direct pattern-match signal rather than requiring it to infer that "record a demo" ≡ "automate browser interactions".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions