Summary
When running GAIA benchmark with agent_type=acp-claude, 3 specific instances consistently fail with UsagePolicyRefusal errors. These failures are deterministic and occur across multiple runs.
Affected Instances
| Instance ID |
Task Type |
Analysis |
2a649bb1-795f-4a01-b3be-9a01868dae73 |
Scientific research query |
False positive - asking about virus testing chemicals (SPFMV/SPCSV) |
983bba7c-c092-455f-b6c9-7857003d48fc |
Academic cross-reference |
False positive - asking about Hafnia alvei papers |
2d83110e-a098-4ebb-9987-066c06fa42d0 |
Reversed text puzzle |
Likely triggered by obfuscation detection |
Impact
- Failure rate: 3/500 = 0.6% of GAIA validation set
- Benchmarks affected: GAIA only (Commit0 has 0 failures)
- Models affected: claude-opus-4-6 via ACP
Error Details
UsagePolicyRefusal: Internal error: API Error: Claude Code is unable to respond
to this request, which appears to violate our Usage Policy
(https://www.anthropic.com/legal/aup). Try rephrasing the request or attempting
a different approach. If you are seeing this refusal repeatedly, try running
/model claude-sonnet-4-20250514 to switch models.
Characteristics:
- Immediate refusal (~2 seconds)
- Zero tokens consumed
- Not retriable (3-4 retries all fail)
- Pre-flight rejection before model processes input
Task Details
Instance 1: 2a649bb1-... (Virology Research)
What are the EC numbers of the two most commonly used chemicals for the virus
testing method in the paper about SPFMV and SPCSV in the Pearl Of Africa from
2016? Return the semicolon-separated numbers in the order of the alphabetized
chemicals.
- SPFMV = Sweet Potato Feathery Mottle Virus
- SPCSV = Sweet Potato Chlorotic Stunt Virus
- "Pearl of Africa" = Uganda
- EC numbers = Enzyme Commission classification (scientific identifiers)
Instance 2: 983bba7c-... (Microbiology Cross-Reference)
What animals that were mentioned in both Ilias Lagkouvardos's and Olga Tapia's
papers on the alvei species of the genus named for Copenhagen outside the
bibliographies were also present in the 2021 article cited on the alvei species'
Wikipedia page about a multicenter, randomized, double-blind study?
- "Genus named for Copenhagen" = Hafnia (Latin for Copenhagen)
- Hafnia alvei = bacterium studied in gut microbiome research
Instance 3: 2d83110e-... (Reversed Text Puzzle)
.rewsna eht sa "tfel" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI
Decoded: "If you understand this sentence, write the opposite of the word 'left' as the answer."
Expected answer: "right"
Runs Verified
| Run ID |
eval_limit |
UsagePolicyRefusal Count |
| 22935238175 |
500 |
3 (same instances) |
| 22928852361 |
500 |
3 (same instances) |
Proposed Solutions
Short-term Workarounds
-
Instance exclusion list: Add these 3 instance IDs to a known-failing list for ACP Claude scoring adjustments
-
Model switching fallback: Implement auto-fallback to claude-sonnet-4-20250514 when UsagePolicyRefusal is detected:
if isinstance(error, UsagePolicyRefusalError):
# Retry with alternative model
response = prompt_with_model("claude-sonnet-4-20250514", task)
Long-term
- Report to Claude Code team: Instances 1 & 2 are clear false positives on legitimate scientific research queries
- Prompt preprocessing: For reversed text instances, consider decoding before sending to Claude Code
Labels
Summary
When running GAIA benchmark with
agent_type=acp-claude, 3 specific instances consistently fail withUsagePolicyRefusalerrors. These failures are deterministic and occur across multiple runs.Affected Instances
2a649bb1-795f-4a01-b3be-9a01868dae73983bba7c-c092-455f-b6c9-7857003d48fc2d83110e-a098-4ebb-9987-066c06fa42d0Impact
Error Details
Characteristics:
Task Details
Instance 1:
2a649bb1-...(Virology Research)Instance 2:
983bba7c-...(Microbiology Cross-Reference)Instance 3:
2d83110e-...(Reversed Text Puzzle)Decoded: "If you understand this sentence, write the opposite of the word 'left' as the answer."
Expected answer: "right"
Runs Verified
Proposed Solutions
Short-term Workarounds
Instance exclusion list: Add these 3 instance IDs to a known-failing list for ACP Claude scoring adjustments
Model switching fallback: Implement auto-fallback to
claude-sonnet-4-20250514whenUsagePolicyRefusalis detected:Long-term
Labels
bugacpgaia