Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#951

Open

MahanFathi wants to merge 112 commits intomainfrom

mfathi/turing_envs_v2

Contributor

MahanFathi commented Mar 24, 2026 •

edited

Loading

Important

This PR is a successor to #801, branch feature/nvidia-IF-bench-validators-integrations. The main changes were already there, this branch includes changes we needed to successfully test the environments. Let's merge this PR instead of #801.

Important note to @bxyu-nvidia: rollout_collection.py was changed in the abovelinked PR, which requires your careful review.

Summary of my changes

Dynamic judge URL discovery — Added judge_server_name config field so the judge URL is resolved at runtime from the NeMo-Gym server registry instead of being hardcoded. Enables use of local_vllm_model (which actually spins up vLLM via Ray) as the judge server type.
Lazy import fix in profiling.py — Moved gprof2dot/pydot imports inside dump() to avoid ModuleNotFoundError on Ray workers where profiling deps aren't installed.
Configurable reward aggregation — Replaced hard-coded all-or-nothing (all) reward with a configurable aggregation_mode supporting all, any, mean, min, and max. Default remains all (no behavior change).
Thinking trace stripping — Added helpers to skip type="reasoning" output items and strip <think>/<thinking> tags before evaluation, preventing chain-of-thought from contaminating validator checks and judge prompts.
Judge prompt restructuring — Reordered LLM_JUDGE_QUESTION_PROMPT to present conversation context before the model response, and replaced fragile JSON output format with robust [[YES]]/[[NO]] bracket markers plus multi-tier fallback extraction. Eliminates silent false negatives from JSON parse failures.
Configurable judge sampling parameters — Exposed judge_temperature (default 0.7), judge_top_p (default 0.8), and judge_max_tokens (default 10000) as config fields, replacing previously hardcoded values.
Config & docs updates — Base turing_vif.yaml updated with new fields; README documented aggregation_mode.

All changes are backwards-compatible.

Checklist

Ran successful experiments with Multichallenge and InverseIFEval datasets on Nano-v3
@abukharin-nv is wrapping up his experiments using the same env on CFBench and SysBench
Gitlab issues for training datasets have been created (to my knowledge MC and IIFEval are approved by legal)

dhrutisundar-turing and others added 30 commits

January 7, 2026 14:52


          integrate VIF validators and add test jsonl files

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          Add env.example

2160f2f

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          Update How to start.md

adb12b5

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          Flagging validation issues and writing them into error.json

8ae2d61

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Added comments

488415d

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Merge pull request #2 from dhrutisundar-turing/validation-flagging

dac8825

Flagging validation issues and writing them into error.json

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          ADD pass criteria support

ab68bb6

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          ADD multi lang support

e473520

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          Merge pull request #3 from dhrutisundar-turing/IFTL-218-multi-lang-su…

b847bfe

…pport

[IFTL-218] Multi-Lang Support

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>


          separated the validators into language folders

b7ae788

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Code cleaning

e360b95

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          verified with all tests in the official guide

42f537c

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Removing unsupported instruction from language validators.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          fixed ascii display for non-english language

7f3f786

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Addinf SPDX header to all python files

9a870f1

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          code cleanup

0def99f

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Merge branch 'fixes/lang_validator' of https://github.com/dhrutisunda…

5062eca

…r-turing/Nvidia-gym-turing into fixes/lang_validator

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          cleaned up licence headers

19ad4d5

Signed-off-by: qasimo-debug <qasim.o@turing.com>


          Merge pull request #5 from dhrutisundar-turing/fixes/lang_validator

b5ff81a

Fixes/lang validator

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Merge branch 'main' into feature/nvidia-IF-bench-validators-integrations

a4dc771

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          Rollout collection tutorial fixes (#790)

41e8e51

Remove the "Where Do Reward Scores Come From?" note that implied custom
verification logic is optional. Also fix tutorial goals to match actual
content and correct the resource server name.

Fixes #776

Signed-off-by: Chris Wing <cwing@nvidia.com>


          docs: align tutorial time (#791)

644c3b4

change tutorial card est time from 45-90 to 30 mins as in the tutorial
itself

#780

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>


          docs: Move environment best practices from contributing to environmen…

7bda8c4

…t tutorials section (#785)

Signed-off-by: Brian Yu <bxyu@nvidia.com>

Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>


          fix: typos in verifiers agent readme (#755)

c0ebfa7

5927179

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>


          Merge branch 'main' into feature/nvidia-IF-bench-validators-integrations

33d8dd3

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          fix: resolve rollout_collection.py test failures and missing imports

04fc646

- Extract _preprocess_rows_from_config from duplicate run_from_config
- Add missing imports: json, deepcopy, Union, Literal
- Add return results to run_from_config
- Remove dead _post_coroutine block (undefined server_client reference)

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>


          style: apply ruff format and update pre-commit hooks

b99f1ac

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor


          Merge branch 'main' of NVIDIA-NeMo/Gym into feature/nvidia-IF-bench-v…

274e388

…alidators-integrations


          chore: enable coverage for RolloutCollectionHelper and remove stale a…

20f4cd0

…rtifacts

- Remove # pragma: no cover from RolloutCollectionHelper class
- Drop stale how_to_start.md entry from .gitignore
- Delete tracked example_rollouts.jsonl generated artifact

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor


          chore: add turing_vif to resource server table in README

9fa969f

Auto-generated by update-readme-table pre-commit hook.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor

gwarmstrong and others added 19 commits

March 17, 2026 13:49


          GPQA benchmark (#903)

652608a

Add GPQA benchmark, per_sample_aggregate, and refined metrics

* Add benchmarks/gpqa/ (config, prepare.py, prompt, README)
* Extract common utilities for metrics
* Add grading mode override to mcqa environment

---------

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Co-authored-by: gwarmstrong <gwarmstrong@users.noreply.github.com>


          Structured Outputs update with YAML and XML (#865)

94acfff

Adds YAML and XML parser logic for the structured outputs verifier.

Included addition of new dataset with JSON, YAML, and XML data.
config found in 
```
resources_servers/structured_outputs/configs/structured_outputs_json_yaml_xml_v1.yaml
```

Results for validation set on GPT-5.4 high effort
```
=================================================================
  rollouts/structured_outputs/gpt-5.4-2026-03-05/rollouts.jsonl
=================================================================

  OVERALL
    n = 357
    pass: 248/357 (69.5%)
    mean reward: 0.6947

-----------------------------------------------------------------

  schema_type=json
    n = 119
    pass: 89/119 (74.8%)
    mean reward: 0.7479

  schema_type=xml
    n = 119
    pass: 79/119 (66.4%)
    mean reward: 0.6639

  schema_type=yaml
    n = 119
    pass: 80/119 (67.2%)
    mean reward: 0.6723

=================================================================
```

---------

Signed-off-by: Junkeun Yi <jkyi@nvidia.com>

da18292

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>


          feat: langgraph integration (#877)

55d71c3

Replacement PR for: #797
Contributors: @cmunley1 

generic langgraph adapter

reflection, rewoo, orchestrator, parallel thinking example

orchestrator, math: 
<img width="630" height="212" alt="image"
src="https://github.com/user-attachments/assets/ec61bc2d-97f6-4ec6-b78d-fa5dace9cd43"
/>

---------

Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@>
Co-authored-by: cmunley1 <cmunley@nvidia.com>


          Add proof environments (#907)

1d93a89

Three mathematical proof environments:
* Proof Judge: Evaluating a model-generated proof correctness
* Proof Validation: Evaluating a model's judgement of a proof corectness
* Proof Genselect: Picking correct proof amonth two given proofs.

---------

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>


          feat: Benchmark infra refactors (#906)

d581116

1. Refactor benchmark config yaml structure to support running multiple
benchmarks at a time
2. Support benchmark running through ng_e2e_collect_rollouts
3. Miscellaneous improvements to MCQA
4. Miscellaneous cleanup on old benchmark infra
5. Rename `swap_key` directive to `inherit_from` per suggestion from
@gwarmstrong
6. Add docs about suggested benchmark running

---------

Signed-off-by: Brian Yu <bxyu@nvidia.com>


          [Fix] use venv Python for swerl_gen Ray workers instead of hardcoded …

cb319a2

…PYTHONPATH (#920)

## Fix: use venv Python for swerl_gen Ray workers

Replace the hardcoded
`PYTHONPATH="/opt/nemo-rl/3rdparty/Gym-workspace/Gym"` in
`singularity_utils.py` with `py_executable: sys.executable`.

### Problem

The `compute_score` Ray remote task hardcodes an internal deployment
path in its `runtime_env`. This breaks any setup where Gym is installed
at a different location.

### Fix

Set `py_executable: sys.executable` so Ray workers use the same venv
Python as the server process. Since `nemo_gym` is installed as an
editable package and `resources_servers` is included in
`pyproject.toml`'s package list, all imports resolve correctly from the
venv — no `PYTHONPATH` override needed.

This is the same pattern applied to `compute_code_generation_metrics.py`
in #908.

Signed-off-by: spacegoing <spacegoing@gmail.com>


          [Fix] guard nltk download with local find() to avoid unnecessary remo…

bb2c233

…te fetch (#919)

`nltk.download()` always fetches the remote package index even when the
data is already present. Guard with a local `find()` first to skip the
download when the data already exists.

Signed-off-by: spacegoing <spacegoing@gmail.com>


          [fix] (code_gen): use runtime_env py_executable for Ray workers (#913)

f4aa2f4

`check_correctness_remote` was decorated with `@ray.remote` without
`runtime_env`, so Ray workers spawned on system Python could not import
  `lcb_integration`. The workaround was a symlink from the server's venv
  into `/usr/local/lib/python3.12/dist-packages/`.

  Fix: add `runtime_env={"py_executable": sys.executable}` to the
  decorator, consistent with the pattern already used in `swe_agents`,
  `mini_swe_agent`, and `harbor_agent`.

  This ensures Ray workers inherit the code_gen server's venv and can
  import `lcb_integration` directly, with no system path side effects.

---------

Signed-off-by: spacegoing <spacegoing@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>


          docs: version bump, CTA link changes (#880)

61e22ae

Replacement PR for: #722
Contributors: @lbliii

---------

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Lawrence Lane <llane@nvidia.com>


          Add zero reward group option for proof judge environment (#923)

ad9cc6b

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>


          fix: always send session cookie for starlette >= 1.0.0 (#942)

157f24f

Signed-off-by: cmunley1 <cmunley@nvidia.com>


          feat: Fix duplicated usage counting and errors on empty usage in subs…

8e58d01

…equent model calls (#939)

Signed-off-by: Brian Yu <bxyu@nvidia.com>


          benchmark: LiveCodeBench v5 and v6 (#933)

343fda5

Thanks to @gwarmstrong

---------

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Co-authored-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

5ab03df

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>


          fix: use lazy imports for gprof2dot and pydot in profiling module

47b2f95

Align with upstream/main to avoid ImportError when pydot is not installed
at module load time.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Mahan Fathi <mahan0fathi@gmail.com>


          style: fix ruff-format line length in turing_vif app

97feaae

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Mahan Fathi <mahan0fathi@gmail.com>


          docs: update resources server table in README

4e25fe0

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>


          merge: resolve conflicts with origin/main

efe80f4

Merge main into branch, resolving conflicts in README.md,
nemo_gym/rollout_collection.py, and reasoning_gym config.
Reverted non-turing_vif changes (reasoning_gym.yaml, env.example.yaml)
to match main.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

MahanFathi requested a review from bxyu-nvidia

March 24, 2026 18:27

MahanFathi assigned bxyu-nvidia

MahanFathi added the resources-server label

copy-pr-bot bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

MahanFathi added 4 commits

March 24, 2026 11:29


          fix: remove untracked equivalence_llm_judge_no_swap entry from README

84607e1

The config file is not committed to this branch, so the CI
pre-commit hook correctly removes its row from the table.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>


          fix(turing_vif): return reward 1.0 when no instructions are present

7daf423

Empty scores list means no constraints to violate, so the reward
should be 1.0 (vacuous truth), not 0.0.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>


          fix(turing_vif): regenerate example_metrics.json without stale Median…

18c4c2e

… fields

The metrics format changed upstream (Median fields were removed),
causing data validation to fail on the outdated cached metrics.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>


          Add example rollouts for turing_vif resources server

d93302f

Generated by running ng_collect_rollouts against the 5 example
entries using gcp/google/gemini-3-flash-preview via the NVIDIA
inference API. Required for CI data validation step.

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

cmunley1 reviewed

View reviewed changes

nemo_gym/rollout_collection.py

                   async def run_from_config(self, config: RolloutCollectionConfig) -> Tuple[List[Dict]]:
                       output_fpath = Path(config.output_jsonl_fpath)
+                      # turing_vif change: sidecar file that collects rollouts excluded from the main output

Contributor

cmunley1 Mar 30, 2026

i wonder if this can live in the resources server / agent instead of core

Contributor Author

MahanFathi Mar 31, 2026

i'm a little clueless honestly (this is not originally my code). do you think you can apply the changes?

Contributor

bxyu-nvidia Apr 16, 2026 •

edited

Loading

We should refactor the resources server in this PR so samples do not fail i.e. we expect code quality to be good enough that we have no issue running

Contributor

cmunley1 Apr 17, 2026 •

edited

Loading

I am thinking we just remove this skip logic, and let reward be 0 if the task failed. The example on line 246 of bad task data or unsupported language/instruction: if reward 0, this is something reward profiling can catch, and filter that way.

Contributor Author

MahanFathi Apr 17, 2026

@cmunley1 is this branch a fix to the above? cmunley1/turing_envs_fix

cmunley1 mentioned this pull request

Feature/nvidia if bench validators integrations #801

Closed

cmunley1 requested changes

View reviewed changes

resources_servers/turing_vif/README.md


		The server automatically detects the model type and uses the appropriate API.

		## Dataset Format

Contributor

cmunley1 Apr 17, 2026

Where is the dataset for this environment? Is it released publicly? Can you please provide instructions to download and prepare the dataset if it is public, or make it clear if it is not

Contributor Author

MahanFathi Apr 17, 2026

this env targets 4 datasets: mulltichallenge, inverse_ifeval, CF-Bench and SysBench. my understanding is that they're all vendor collected and they're not all public yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

resources-server