chore(llm): Better defense against prompt injection in triage skill#19410
chore(llm): Better defense against prompt injection in triage skill#19410
Conversation
Codecov Results 📊Generated by Codecov Action |
size-limit report 📦
|
node-overhead report 🧳Note: This is a synthetic benchmark with a minimal express app and does not necessarily reflect the real-world performance impact in an application.
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| (r"\b(ignore|disregard|forget)\s+(all\s+)?(previous|prior|above)\s+(instructions?|prompts?|rules?)", 8, "Instruction override"), | ||
|
|
||
| # Prompt extraction (8 points) | ||
| (r"\b(show|reveal|display|output|print)\s+(your\s+)?(system\s+)?(prompt|instructions?)", 8, "Prompt extraction attempt"), |
There was a problem hiding this comment.
Prompt extraction regex matches common English phrases
High Severity
The "Prompt extraction attempt" pattern \b(show|reveal|display|output|print)\s+(your\s+)?(system\s+)?(prompt|instructions?) scores 8 points — exactly the rejection threshold — but both (your\s+)? and (system\s+)? are optional. This means perfectly innocuous phrases like "print instructions", "show instructions", or "display instructions" alone trigger full rejection. The project's own docs/changelog/v8.md even uses "we print instructions on how to prepare for the next major", so an issue quoting or paraphrasing that text would be falsely rejected. The intent was to catch "show your system prompt" or "reveal your instructions" but the double-optional qualifiers make the pattern far too broad for its high confidence score.


Adds
Closes #19411 (added automatically)