skill: description causes ~20% recall gap in LLM routing (baseline recall=0.800)

## Problem

The playwright-cli skill description is the only signal an LLM reads before deciding whether to invoke the skill. The current description:

> *Automate browser interactions, test web pages and work with Playwright tests.*

causes the model to miss roughly **20% of valid invocations** when it cannot rely on prior knowledge of the skill name.

## Benchmark evidence

Tested 5 description variants against a 50-prompt set (40 should-trigger / 10 should-not-trigger) using **OpenAI Codex as the routing judge**. To eliminate the prior-knowledge confound, the skill was renamed to the fictional name `headless-pilot` — forcing Codex to route purely from description text, not from training data about playwright.

| Variant | Recall | Precision | F1 |
|---|---|---|---|
| **baseline (current)** | **0.800** | 1.000 | 0.889 |
| v1 explicit triggers | 0.950 | 1.000 | 0.974 |
| v2 intent-first | 0.925 | 1.000 | 0.961 |
| v3 verb-dense | 0.950 | 1.000 | 0.974 |
| **v4 "Use when" style** | **0.975** | **1.000** | **0.987** |

All variants had **perfect precision** (no false positives). The gap is entirely in recall.

## Prompts the baseline misses

These 8 prompts have obvious playwright-cli solutions but don't contain the words "automate" or "browser interactions":

- `run tests/checkout.spec.ts in headed mode`
- `debug why tests/auth.spec.ts line 42 is broken`
- `record a demo video of the checkout flow`
- `export the pricing page as a PDF`
- `record a video with chapter titles for the onboarding demo`
- `mock the /api/users endpoint to return an empty array`
- `intercept network requests and log which ones are slow`
- *(one additional borderline case)*

## Proposed fix

PR #409 changes only the `description` field in `skills/playwright-cli/SKILL.md` to the winning v4 style:

```yaml
description: >
  Use when the user says: "go to this URL", "click this button", "fill this
  form", "take a screenshot", "scrape this page", "log in to X", "run my
  Playwright tests", "this test is failing", "write a test for", "mock this
  API", "record a demo". The skill opens a live browser and drives it.
```

This quotes the natural-language trigger phrases users actually type, giving the routing model a direct pattern-match signal rather than requiring it to infer that "record a demo" ≡ "automate browser interactions".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skill: description causes ~20% recall gap in LLM routing (baseline recall=0.800) #410

Problem

Benchmark evidence

Prompts the baseline misses

Proposed fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Variant	Recall	Precision	F1
baseline (current)	0.800	1.000	0.889
v1 explicit triggers	0.950	1.000	0.974
v2 intent-first	0.925	1.000	0.961
v3 verb-dense	0.950	1.000	0.974
v4 "Use when" style	0.975	1.000	0.987

skill: description causes ~20% recall gap in LLM routing (baseline recall=0.800) #410

Description

Problem

Benchmark evidence

Prompts the baseline misses

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions