Skip to content

eval skill: parameterize external judge/user-sim endpoints via .env#1591

Open
cjluo-nv wants to merge 1 commit into
mainfrom
skill/eval-judge-endpoints-env
Open

eval skill: parameterize external judge/user-sim endpoints via .env#1591
cjluo-nv wants to merge 1 commit into
mainfrom
skill/eval-judge-endpoints-env

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Jun 1, 2026

What does this PR do?

Type of change: documentation

Several AA tasks call an external judge / user-simulator / scoring endpoint whose model_id + url vary per user/site (HLE, AA-LCR, Tau2 today — and the guidance is written as a general pattern for any future such benchmark). Previously each recipe hardcoded <...> placeholders that a user had to hand-edit in every config. This makes them reusable without committing any internal infrastructure:

  • recipes/env.example: add placeholders — NS_JUDGE_URL, HLE_JUDGE_MODEL_ID, LCR_JUDGE_MODEL_ID, TAU2_USER_MODEL_ID, TAU2_JUDGER_MODEL_ID, TAU2_ENDPOINT_URL — with the recommended model named (GPT-4o / Qwen3 235B / gpt-oss-120B) but only generic hosts (https://<your-inference-host>/v1). No internal hostnames or gateway model-routing strings are committed; real values live in the user's gitignored .env.
  • recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md: carry <VAR> literal placeholders (named after the .env keys) that the skill substitutes as literal values from the user's .env. These are config, not secrets, so they are not exported — which avoids the ${oc.env:...} footgun (it silently fails unless the var was exported with set -a). Only api_key (INFERENCE_API_KEY) stays an exported env var read by the harness.
  • SKILL.md (Step 5): instructs literal substitution from .env, framed as a general pattern for any external-endpoint task.

The /v1-base (nemo-skills) vs full /v1/chat/completions (tau2-bench) URL distinction is documented.

Usage

N/A — documentation / skill-template only.

Testing

pre-commit run passes (markdownlint). Verified no internal hostnames / gateway model IDs are present in any committed file.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (documentation)
  • Did you update Changelog?: N/A (skill docs)
  • Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

Branched off latest main (includes #1583). Touches lcr.md, which #1583 also edited (parallelism field) — different lines, rebased cleanly.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guidance for configuring external judge and user-simulator endpoints in evaluation tasks
    • Clarified best practices for substituting configuration values from environment configuration files while keeping API keys as environment variables
    • Updated task documentation for HLE, LCR, and Tau2-Bench with improved instructions for credential and endpoint handling

Judge / user-simulator / scoring endpoints (HLE, AA-LCR, Tau2 today, and any
future such task) need a model_id + URL that vary per user/site. Make them
reusable without committing internal infra:

- recipes/env.example: add NS_JUDGE_URL / HLE_JUDGE_MODEL_ID / LCR_JUDGE_MODEL_ID
  / TAU2_USER_MODEL_ID / TAU2_JUDGER_MODEL_ID / TAU2_ENDPOINT_URL placeholders
  with recommended models named (GPT-4o / Qwen3 235B / gpt-oss-120B) and GENERIC
  hosts only (no internal hostnames or gateway model-routing strings).
- recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md: carry <VAR> literal
  placeholders the skill substitutes from the user's .env. These are config, not
  secrets, so they are NOT exported (avoids the set-a/${oc.env:} footgun); only
  api_key (INFERENCE_API_KEY) stays an exported env var read by the harness.
- SKILL.md: instruct literal substitution from .env, framed as a general pattern
  for any external-endpoint task (not a fixed 3-task list).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 25b7c094-fa16-451a-a73e-0e59a85c42f6

📥 Commits

Reviewing files that changed from the base of the PR and between 38c7843 and 0e9a6fc.

📒 Files selected for processing (5)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/env.example
  • .claude/skills/evaluation/recipes/tasks/aa/hle.md
  • .claude/skills/evaluation/recipes/tasks/aa/lcr.md
  • .claude/skills/evaluation/recipes/tasks/aa/tau2_bench_telecom.md

📝 Walkthrough

Walkthrough

This PR updates documentation and configuration examples for external judge and user-simulator endpoint setup across evaluation tasks. It establishes a consistent pattern for substituting literal .env values for model IDs and URLs while keeping only the API key as an exported environment variable.

Changes

Judge Endpoint Configuration

Layer / File(s) Summary
Core guidance and environment template
.claude/skills/evaluation/SKILL.md, .claude/skills/evaluation/recipes/env.example
Adds general guidance on configuring external judge/user-simulator endpoints and defines placeholder environment variables for HLE, AA-LCR, and Tau2-Bench tasks.
HLE task recipe updates
.claude/skills/evaluation/recipes/tasks/aa/hle.md
Clarifies HLE judge configuration to use literal .env values (HLE_JUDGE_MODEL_ID, NS_JUDGE_URL) with only INFERENCE_API_KEY exported.
LCR task recipe updates
.claude/skills/evaluation/recipes/tasks/aa/lcr.md
Specifies LCR judge configuration using .env placeholders (LCR_JUDGE_MODEL_ID, NS_JUDGE_URL) with Qwen3 235B recommended, keeping only INFERENCE_API_KEY exported.
Tau2-Bench task recipe updates
.claude/skills/evaluation/recipes/tasks/aa/tau2_bench_telecom.md
Updates Tau2-Bench configuration to use task-specific .env placeholders (TAU2_USER_MODEL_ID, TAU2_JUDGER_MODEL_ID, TAU2_ENDPOINT_URL with full /v1/chat/completions path) and INFERENCE_API_KEY for exports.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

  • kaix-nv
  • meenchen
  • chadvoegele
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'eval skill: parameterize external judge/user-sim endpoints via .env' accurately summarizes the main change—parameterizing external judge and user-simulator endpoints through environment variables.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR contains only Markdown documentation and configuration examples—no Python code, pyproject.toml, or requirements.txt changes. Security check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch skill/eval-judge-endpoints-env

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Pure-documentation update to the evaluation skill that parameterizes external judge / user-simulator endpoints via .env placeholders. Small (51/-19), internally consistent: the env-var names defined in recipes/env.example (HLE_JUDGE_MODEL_ID, LCR_JUDGE_MODEL_ID, NS_JUDGE_URL, TAU2_USER_MODEL_ID, TAU2_JUDGER_MODEL_ID, TAU2_ENDPOINT_URL) match the <VAR> placeholders in hle.md, lcr.md, tau2_bench_telecom.md, and SKILL.md Step 5's general guidance. The /v1 vs full /v1/chat/completions distinction is correctly captured (nemo-skills uses base, tau2 needs full path). Literal-substitution-not-${oc.env:...} rationale is uniform across all five files. No tests needed (docs only); no licensing changes; rebase against #1583's lcr.md edits looks clean.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.23%. Comparing base (f0d2237) to head (0e9a6fc).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1591   +/-   ##
=======================================
  Coverage   73.23%   73.23%           
=======================================
  Files         479      479           
  Lines       52435    52435           
=======================================
  Hits        38401    38401           
  Misses      14034    14034           
Flag Coverage Δ
unit 53.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# JUDGE_API_KEY=
# INFERENCE_API_KEY=

# --- Optional: judge / user-simulator endpoints (model_id + URL) ---
Copy link
Copy Markdown
Contributor

@meenchen meenchen Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to set something globally across agent sessions? so we don't have to set this up when starting a new one, e.g., a new working space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants