eval skill: parameterize external judge/user-sim endpoints via .env by cjluo-nv · Pull Request #1591 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-06-01T21:36:15Z

What does this PR do?

Type of change: documentation

Several AA tasks call an external judge / user-simulator / scoring endpoint whose model_id + url vary per user/site (HLE, AA-LCR, Tau2 today — and the guidance is written as a general pattern for any future such benchmark). Previously each recipe hardcoded <...> placeholders that a user had to hand-edit in every config. This makes them reusable without committing any internal infrastructure:

recipes/env.example: add placeholders — NS_JUDGE_URL, HLE_JUDGE_MODEL_ID, LCR_JUDGE_MODEL_ID, TAU2_USER_MODEL_ID, TAU2_JUDGER_MODEL_ID, TAU2_ENDPOINT_URL — with the recommended model named (GPT-4o / Qwen3 235B / gpt-oss-120B) but only generic hosts (https://<your-inference-host>/v1). No internal hostnames or gateway model-routing strings are committed; real values live in the user's gitignored .env.
recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md: carry <VAR> literal placeholders (named after the .env keys) that the skill substitutes as literal values from the user's .env. These are config, not secrets, so they are not exported — which avoids the ${oc.env:...} footgun (it silently fails unless the var was exported with set -a). Only api_key (INFERENCE_API_KEY) stays an exported env var read by the harness.
SKILL.md (Step 5): instructs literal substitution from .env, framed as a general pattern for any external-endpoint task.

The /v1-base (nemo-skills) vs full /v1/chat/completions (tau2-bench) URL distinction is documented.

Usage

N/A — documentation / skill-template only.

Testing

pre-commit run passes (markdownlint). Verified no internal hostnames / gateway model IDs are present in any committed file.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A (documentation)
Did you update Changelog?: N/A (skill docs)
Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

Branched off latest main (includes #1583). Touches lcr.md, which #1583 also edited (parallelism field) — different lines, rebased cleanly.

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Added comprehensive guidance for configuring external judge and user-simulator endpoints in evaluation tasks
- Clarified best practices for substituting configuration values from environment configuration files while keeping API keys as environment variables
- Updated task documentation for HLE, LCR, and Tau2-Bench with improved instructions for credential and endpoint handling

Judge / user-simulator / scoring endpoints (HLE, AA-LCR, Tau2 today, and any future such task) need a model_id + URL that vary per user/site. Make them reusable without committing internal infra: - recipes/env.example: add NS_JUDGE_URL / HLE_JUDGE_MODEL_ID / LCR_JUDGE_MODEL_ID / TAU2_USER_MODEL_ID / TAU2_JUDGER_MODEL_ID / TAU2_ENDPOINT_URL placeholders with recommended models named (GPT-4o / Qwen3 235B / gpt-oss-120B) and GENERIC hosts only (no internal hostnames or gateway model-routing strings). - recipes/tasks/aa/{hle,lcr,tau2_bench_telecom}.md: carry <VAR> literal placeholders the skill substitutes from the user's .env. These are config, not secrets, so they are NOT exported (avoids the set-a/${oc.env:} footgun); only api_key (INFERENCE_API_KEY) stays an exported env var read by the harness. - SKILL.md: instruct literal substitution from .env, framed as a general pattern for any external-endpoint task (not a fixed 3-task list). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-06-01T21:36:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 25b7c094-fa16-451a-a73e-0e59a85c42f6

📥 Commits

Reviewing files that changed from the base of the PR and between 38c7843 and 0e9a6fc.

📒 Files selected for processing (5)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/env.example
.claude/skills/evaluation/recipes/tasks/aa/hle.md
.claude/skills/evaluation/recipes/tasks/aa/lcr.md
.claude/skills/evaluation/recipes/tasks/aa/tau2_bench_telecom.md

📝 Walkthrough

Walkthrough

This PR updates documentation and configuration examples for external judge and user-simulator endpoint setup across evaluation tasks. It establishes a consistent pattern for substituting literal .env values for model IDs and URLs while keeping only the API key as an exported environment variable.

Changes

Judge Endpoint Configuration

Layer / File(s)	Summary
Core guidance and environment template `.claude/skills/evaluation/SKILL.md`, `.claude/skills/evaluation/recipes/env.example`	Adds general guidance on configuring external judge/user-simulator endpoints and defines placeholder environment variables for HLE, AA-LCR, and Tau2-Bench tasks.
HLE task recipe updates `.claude/skills/evaluation/recipes/tasks/aa/hle.md`	Clarifies HLE judge configuration to use literal `.env` values (`HLE_JUDGE_MODEL_ID`, `NS_JUDGE_URL`) with only `INFERENCE_API_KEY` exported.
LCR task recipe updates `.claude/skills/evaluation/recipes/tasks/aa/lcr.md`	Specifies LCR judge configuration using `.env` placeholders (`LCR_JUDGE_MODEL_ID`, `NS_JUDGE_URL`) with Qwen3 235B recommended, keeping only `INFERENCE_API_KEY` exported.
Tau2-Bench task recipe updates `.claude/skills/evaluation/recipes/tasks/aa/tau2_bench_telecom.md`	Updates Tau2-Bench configuration to use task-specific `.env` placeholders (`TAU2_USER_MODEL_ID`, `TAU2_JUDGER_MODEL_ID`, `TAU2_ENDPOINT_URL` with full `/v1/chat/completions` path) and `INFERENCE_API_KEY` for exports.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

kaix-nv
meenchen
chadvoegele

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'eval skill: parameterize external judge/user-sim endpoints via .env' accurately summarizes the main change—parameterizing external judge and user-simulator endpoints through environment variables.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR contains only Markdown documentation and configuration examples—no Python code, pyproject.toml, or requirements.txt changes. Security check is not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch skill/eval-judge-endpoints-env

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

meenchen

Bot review — DM the bot to share feedback.

Pure-documentation update to the evaluation skill that parameterizes external judge / user-simulator endpoints via .env placeholders. Small (51/-19), internally consistent: the env-var names defined in recipes/env.example (HLE_JUDGE_MODEL_ID, LCR_JUDGE_MODEL_ID, NS_JUDGE_URL, TAU2_USER_MODEL_ID, TAU2_JUDGER_MODEL_ID, TAU2_ENDPOINT_URL) match the <VAR> placeholders in hle.md, lcr.md, tau2_bench_telecom.md, and SKILL.md Step 5's general guidance. The /v1 vs full /v1/chat/completions distinction is correctly captured (nemo-skills uses base, tau2 needs full path). Literal-substitution-not-${oc.env:...} rationale is uniform across all five files. No tests needed (docs only); no licensing changes; rebase against #1583's lcr.md edits looks clean.

codecov · 2026-06-01T21:49:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.23%. Comparing base (f0d2237) to head (0e9a6fc).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1591   +/-   ##
=======================================
  Coverage   73.23%   73.23%           
=======================================
  Files         479      479           
  Lines       52435    52435           
=======================================
  Hits        38401    38401           
  Misses      14034    14034

Flag	Coverage Δ
unit	`53.62% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meenchen · 2026-06-01T22:27:54Z

 # JUDGE_API_KEY=
 # INFERENCE_API_KEY=

+# --- Optional: judge / user-simulator endpoints (model_id + URL) ---


Is it possible to set something globally across agent sessions? so we don't have to set this up when starting a new one, e.g., a new working space

cjluo-nv requested review from chadvoegele, kaix-nv and meenchen June 1, 2026 21:37

coderabbitai Bot approved these changes Jun 1, 2026

View reviewed changes

meenchen approved these changes Jun 1, 2026

View reviewed changes

chadvoegele approved these changes Jun 1, 2026

View reviewed changes

meenchen reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval skill: parameterize external judge/user-sim endpoints via .env#1591

eval skill: parameterize external judge/user-sim endpoints via .env#1591
cjluo-nv wants to merge 1 commit into
mainfrom
skill/eval-judge-endpoints-env

cjluo-nv commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

meenchen left a comment

Uh oh!

codecov Bot commented Jun 1, 2026

Uh oh!

meenchen Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cjluo-nv commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 1, 2026

Codecov Report

Uh oh!

meenchen Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

meenchen Jun 1, 2026 •

edited

Loading