Skip to content

Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#951

Open
MahanFathi wants to merge 112 commits intomainfrom
mfathi/turing_envs_v2
Open

Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#951
MahanFathi wants to merge 112 commits intomainfrom
mfathi/turing_envs_v2

Conversation

@MahanFathi
Copy link
Copy Markdown
Contributor

@MahanFathi MahanFathi commented Mar 24, 2026

Important

This PR is a successor to #801, branch feature/nvidia-IF-bench-validators-integrations. The main changes were already there, this branch includes changes we needed to successfully test the environments. Let's merge this PR instead of #801.

Important note to @bxyu-nvidia: rollout_collection.py was changed in the abovelinked PR, which requires your careful review.

Summary of my changes

  • Dynamic judge URL discovery — Added judge_server_name config field so the judge URL is resolved at runtime from the NeMo-Gym server registry instead of being hardcoded. Enables use of local_vllm_model (which actually spins up vLLM via Ray) as the judge server type.
  • Lazy import fix in profiling.py — Moved gprof2dot/pydot imports inside dump() to avoid ModuleNotFoundError on Ray workers where profiling deps aren't installed.
  • Configurable reward aggregation — Replaced hard-coded all-or-nothing (all) reward with a configurable aggregation_mode supporting all, any, mean, min, and max. Default remains all (no behavior change).
  • Thinking trace stripping — Added helpers to skip type="reasoning" output items and strip <think>/<thinking> tags before evaluation, preventing chain-of-thought from contaminating validator checks and judge prompts.
  • Judge prompt restructuring — Reordered LLM_JUDGE_QUESTION_PROMPT to present conversation context before the model response, and replaced fragile JSON output format with robust [[YES]]/[[NO]] bracket markers plus multi-tier fallback extraction. Eliminates silent false negatives from JSON parse failures.
  • Configurable judge sampling parameters — Exposed judge_temperature (default 0.7), judge_top_p (default 0.8), and judge_max_tokens (default 10000) as config fields, replacing previously hardcoded values.
  • Config & docs updates — Base turing_vif.yaml updated with new fields; README documented aggregation_mode.

All changes are backwards-compatible.

Checklist

  • Ran successful experiments with Multichallenge and InverseIFEval datasets on Nano-v3
  • @abukharin-nv is wrapping up his experiments using the same env on CFBench and SysBench
  • Gitlab issues for training datasets have been created (to my knowledge MC and IIFEval are approved by legal)

dhrutisundar-turing and others added 30 commits January 7, 2026 14:52
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Flagging validation issues and writing them into error.json

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
…pport

[IFTL-218] Multi-Lang Support

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
…r-turing/Nvidia-gym-turing into fixes/lang_validator

Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Fixes/lang validator

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Remove the "Where Do Reward Scores Come From?" note that implied custom
verification logic is optional. Also fix tutorial goals to match actual
content and correct the resource server name.

Fixes #776

Signed-off-by: Chris Wing <cwing@nvidia.com>
change tutorial card est time from 45-90 to 30 mins as in the tutorial
itself

#780

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>
…t tutorials section (#785)

Signed-off-by: Brian Yu <bxyu@nvidia.com>

Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>
5927179

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
- Extract _preprocess_rows_from_config from duplicate run_from_config
- Add missing imports: json, deepcopy, Union, Literal
- Add return results to run_from_config
- Remove dead _post_coroutine block (undefined server_client reference)

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
…rtifacts

- Remove # pragma: no cover from RolloutCollectionHelper class
- Drop stale how_to_start.md entry from .gitignore
- Delete tracked example_rollouts.jsonl generated artifact

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
Auto-generated by update-readme-table pre-commit hook.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
gwarmstrong and others added 19 commits March 17, 2026 13:49
Add GPQA benchmark, per_sample_aggregate, and refined metrics

* Add benchmarks/gpqa/ (config, prepare.py, prompt, README)
* Extract common utilities for metrics
* Add grading mode override to mcqa environment

---------

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Co-authored-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Adds YAML and XML parser logic for the structured outputs verifier.

Included addition of new dataset with JSON, YAML, and XML data.
config found in 
```
resources_servers/structured_outputs/configs/structured_outputs_json_yaml_xml_v1.yaml
```

Results for validation set on GPT-5.4 high effort
```
=================================================================
  rollouts/structured_outputs/gpt-5.4-2026-03-05/rollouts.jsonl
=================================================================

  OVERALL
    n = 357
    pass: 248/357 (69.5%)
    mean reward: 0.6947

-----------------------------------------------------------------

  schema_type=json
    n = 119
    pass: 89/119 (74.8%)
    mean reward: 0.7479

  schema_type=xml
    n = 119
    pass: 79/119 (66.4%)
    mean reward: 0.6639

  schema_type=yaml
    n = 119
    pass: 80/119 (67.2%)
    mean reward: 0.6723

=================================================================
```

---------

Signed-off-by: Junkeun Yi <jkyi@nvidia.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Replacement PR for: #797
Contributors: @cmunley1 

generic langgraph adapter

reflection, rewoo, orchestrator, parallel thinking example

orchestrator, math: 
<img width="630" height="212" alt="image"
src="https://github.com/user-attachments/assets/ec61bc2d-97f6-4ec6-b78d-fa5dace9cd43"
/>

---------

Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@>
Co-authored-by: cmunley1 <cmunley@nvidia.com>
Three mathematical proof environments:
* Proof Judge: Evaluating a model-generated proof correctness
* Proof Validation: Evaluating a model's judgement of a proof corectness
* Proof Genselect: Picking correct proof amonth two given proofs.

---------

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
1. Refactor benchmark config yaml structure to support running multiple
benchmarks at a time
2. Support benchmark running through ng_e2e_collect_rollouts
3. Miscellaneous improvements to MCQA
4. Miscellaneous cleanup on old benchmark infra
5. Rename `swap_key` directive to `inherit_from` per suggestion from
@gwarmstrong
6. Add docs about suggested benchmark running

---------

Signed-off-by: Brian Yu <bxyu@nvidia.com>
…PYTHONPATH (#920)

## Fix: use venv Python for swerl_gen Ray workers

Replace the hardcoded
`PYTHONPATH="/opt/nemo-rl/3rdparty/Gym-workspace/Gym"` in
`singularity_utils.py` with `py_executable: sys.executable`.

### Problem

The `compute_score` Ray remote task hardcodes an internal deployment
path in its `runtime_env`. This breaks any setup where Gym is installed
at a different location.

### Fix

Set `py_executable: sys.executable` so Ray workers use the same venv
Python as the server process. Since `nemo_gym` is installed as an
editable package and `resources_servers` is included in
`pyproject.toml`'s package list, all imports resolve correctly from the
venv — no `PYTHONPATH` override needed.

This is the same pattern applied to `compute_code_generation_metrics.py`
in #908.

Signed-off-by: spacegoing <spacegoing@gmail.com>
…te fetch (#919)

`nltk.download()` always fetches the remote package index even when the
data is already present. Guard with a local `find()` first to skip the
download when the data already exists.

Signed-off-by: spacegoing <spacegoing@gmail.com>
`check_correctness_remote` was decorated with `@ray.remote` without
`runtime_env`, so Ray workers spawned on system Python could not import
  `lcb_integration`. The workaround was a symlink from the server's venv
  into `/usr/local/lib/python3.12/dist-packages/`.

  Fix: add `runtime_env={"py_executable": sys.executable}` to the
  decorator, consistent with the pattern already used in `swe_agents`,
  `mini_swe_agent`, and `harbor_agent`.

  This ensures Ray workers inherit the code_gen server's venv and can
  import `lcb_integration` directly, with no system path side effects.

---------

Signed-off-by: spacegoing <spacegoing@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replacement PR for: #722
Contributors: @lbliii

---------

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
…equent model calls (#939)

Signed-off-by: Brian Yu <bxyu@nvidia.com>
Thanks to @gwarmstrong

---------

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Co-authored-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Align with upstream/main to avoid ImportError when pydot is not installed
at module load time.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Mahan Fathi <mahan0fathi@gmail.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Mahan Fathi <mahan0fathi@gmail.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Merge main into branch, resolving conflicts in README.md,
nemo_gym/rollout_collection.py, and reasoning_gym config.
Reverted non-turing_vif changes (reasoning_gym.yaml, env.example.yaml)
to match main.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
@MahanFathi MahanFathi requested a review from bxyu-nvidia March 24, 2026 18:27
@MahanFathi MahanFathi added the resources-server Resources servers (math, code, etc.) label Mar 24, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The config file is not committed to this branch, so the CI
pre-commit hook correctly removes its row from the table.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Empty scores list means no constraints to violate, so the reward
should be 1.0 (vacuous truth), not 0.0.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
… fields

The metrics format changed upstream (Median fields were removed),
causing data validation to fail on the outdated cached metrics.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Generated by running ng_collect_rollouts against the 5 example
entries using gcp/google/gemini-3-flash-preview via the NVIDIA
inference API. Required for CI data validation step.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

async def run_from_config(self, config: RolloutCollectionConfig) -> Tuple[List[Dict]]:
output_fpath = Path(config.output_jsonl_fpath)
# turing_vif change: sidecar file that collects rollouts excluded from the main output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this can live in the resources server / agent instead of core

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little clueless honestly (this is not originally my code). do you think you can apply the changes?

Copy link
Copy Markdown
Contributor

@bxyu-nvidia bxyu-nvidia Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should refactor the resources server in this PR so samples do not fail i.e. we expect code quality to be good enough that we have no issue running

Copy link
Copy Markdown
Contributor

@cmunley1 cmunley1 Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking we just remove this skip logic, and let reward be 0 if the task failed. The example on line 246 of bad task data or unsupported language/instruction: if reward 0, this is something reward profiling can catch, and filter that way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmunley1 is this branch a fix to the above? cmunley1/turing_envs_fix


The server automatically detects the model type and uses the appropriate API.

## Dataset Format
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the dataset for this environment? Is it released publicly? Can you please provide instructions to download and prepare the dataset if it is public, or make it clear if it is not

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this env targets 4 datasets: mulltichallenge, inverse_ifeval, CF-Bench and SysBench. my understanding is that they're all vendor collected and they're not all public yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

resources-server Resources servers (math, code, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.