eval skill: parameterize external judge/user-sim endpoints via .env#1591
eval skill: parameterize external judge/user-sim endpoints via .env#1591cjluo-nv wants to merge 1 commit into
Conversation
Judge / user-simulator / scoring endpoints (HLE, AA-LCR, Tau2 today, and any
future such task) need a model_id + URL that vary per user/site. Make them
reusable without committing internal infra:
- recipes/env.example: add NS_JUDGE_URL / HLE_JUDGE_MODEL_ID / LCR_JUDGE_MODEL_ID
/ TAU2_USER_MODEL_ID / TAU2_JUDGER_MODEL_ID / TAU2_ENDPOINT_URL placeholders
with recommended models named (GPT-4o / Qwen3 235B / gpt-oss-120B) and GENERIC
hosts only (no internal hostnames or gateway model-routing strings).
- recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md: carry <VAR> literal
placeholders the skill substitutes from the user's .env. These are config, not
secrets, so they are NOT exported (avoids the set-a/${oc.env:} footgun); only
api_key (INFERENCE_API_KEY) stays an exported env var read by the harness.
- SKILL.md: instruct literal substitution from .env, framed as a general pattern
for any external-endpoint task (not a fixed 3-task list).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR updates documentation and configuration examples for external judge and user-simulator endpoint setup across evaluation tasks. It establishes a consistent pattern for substituting literal ChangesJudge Endpoint Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Pure-documentation update to the evaluation skill that parameterizes external judge / user-simulator endpoints via .env placeholders. Small (51/-19), internally consistent: the env-var names defined in recipes/env.example (HLE_JUDGE_MODEL_ID, LCR_JUDGE_MODEL_ID, NS_JUDGE_URL, TAU2_USER_MODEL_ID, TAU2_JUDGER_MODEL_ID, TAU2_ENDPOINT_URL) match the <VAR> placeholders in hle.md, lcr.md, tau2_bench_telecom.md, and SKILL.md Step 5's general guidance. The /v1 vs full /v1/chat/completions distinction is correctly captured (nemo-skills uses base, tau2 needs full path). Literal-substitution-not-${oc.env:...} rationale is uniform across all five files. No tests needed (docs only); no licensing changes; rebase against #1583's lcr.md edits looks clean.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1591 +/- ##
=======================================
Coverage 73.23% 73.23%
=======================================
Files 479 479
Lines 52435 52435
=======================================
Hits 38401 38401
Misses 14034 14034
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| # JUDGE_API_KEY= | ||
| # INFERENCE_API_KEY= | ||
|
|
||
| # --- Optional: judge / user-simulator endpoints (model_id + URL) --- |
There was a problem hiding this comment.
Is it possible to set something globally across agent sessions? so we don't have to set this up when starting a new one, e.g., a new working space
What does this PR do?
Type of change: documentation
Several AA tasks call an external judge / user-simulator / scoring endpoint whose
model_id+urlvary per user/site (HLE, AA-LCR, Tau2 today — and the guidance is written as a general pattern for any future such benchmark). Previously each recipe hardcoded<...>placeholders that a user had to hand-edit in every config. This makes them reusable without committing any internal infrastructure:recipes/env.example: add placeholders —NS_JUDGE_URL,HLE_JUDGE_MODEL_ID,LCR_JUDGE_MODEL_ID,TAU2_USER_MODEL_ID,TAU2_JUDGER_MODEL_ID,TAU2_ENDPOINT_URL— with the recommended model named (GPT-4o / Qwen3 235B / gpt-oss-120B) but only generic hosts (https://<your-inference-host>/v1). No internal hostnames or gateway model-routing strings are committed; real values live in the user's gitignored.env.recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md: carry<VAR>literal placeholders (named after the.envkeys) that the skill substitutes as literal values from the user's.env. These are config, not secrets, so they are not exported — which avoids the${oc.env:...}footgun (it silently fails unless the var was exported withset -a). Onlyapi_key(INFERENCE_API_KEY) stays an exported env var read by the harness.SKILL.md(Step 5): instructs literal substitution from.env, framed as a general pattern for any external-endpoint task.The
/v1-base (nemo-skills) vs full/v1/chat/completions(tau2-bench) URL distinction is documented.Usage
N/A — documentation / skill-template only.
Testing
pre-commit runpasses (markdownlint). Verified no internal hostnames / gateway model IDs are present in any committed file.Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
Branched off latest
main(includes #1583). Toucheslcr.md, which #1583 also edited (parallelism field) — different lines, rebased cleanly.🤖 Generated with Claude Code
Summary by CodeRabbit