Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#951
Turing Envs (Covers Multichallenge, InverseIFEval, CFBench and SysBench datasets)#951MahanFathi wants to merge 112 commits intomainfrom
Conversation
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Flagging validation issues and writing them into error.json Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
…pport [IFTL-218] Multi-Lang Support Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
…r-turing/Nvidia-gym-turing into fixes/lang_validator Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Fixes/lang validator Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Remove the "Where Do Reward Scores Come From?" note that implied custom verification logic is optional. Also fix tutorial goals to match actual content and correct the resource server name. Fixes #776 Signed-off-by: Chris Wing <cwing@nvidia.com>
change tutorial card est time from 45-90 to 30 mins as in the tutorial itself #780 Signed-off-by: cmunley1 <cmunley@nvidia.com> Signed-off-by: Christian Munley <cmunley@nvidia.com>
…t tutorials section (#785) Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>
5927179 Signed-off-by: cmunley1 <cmunley@nvidia.com> Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
- Extract _preprocess_rows_from_config from duplicate run_from_config - Add missing imports: json, deepcopy, Union, Literal - Add return results to run_from_config - Remove dead _post_coroutine block (undefined server_client reference) Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor
…alidators-integrations
…rtifacts - Remove # pragma: no cover from RolloutCollectionHelper class - Drop stale how_to_start.md entry from .gitignore - Delete tracked example_rollouts.jsonl generated artifact Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor
Auto-generated by update-readme-table pre-commit hook. Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com> Made-with: Cursor
Add GPQA benchmark, per_sample_aggregate, and refined metrics * Add benchmarks/gpqa/ (config, prepare.py, prompt, README) * Extract common utilities for metrics * Add grading mode override to mcqa environment --------- Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com> Co-authored-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Adds YAML and XML parser logic for the structured outputs verifier.
Included addition of new dataset with JSON, YAML, and XML data.
config found in
```
resources_servers/structured_outputs/configs/structured_outputs_json_yaml_xml_v1.yaml
```
Results for validation set on GPT-5.4 high effort
```
=================================================================
rollouts/structured_outputs/gpt-5.4-2026-03-05/rollouts.jsonl
=================================================================
OVERALL
n = 357
pass: 248/357 (69.5%)
mean reward: 0.6947
-----------------------------------------------------------------
schema_type=json
n = 119
pass: 89/119 (74.8%)
mean reward: 0.7479
schema_type=xml
n = 119
pass: 79/119 (66.4%)
mean reward: 0.6639
schema_type=yaml
n = 119
pass: 80/119 (67.2%)
mean reward: 0.6723
=================================================================
```
---------
Signed-off-by: Junkeun Yi <jkyi@nvidia.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Replacement PR for: #797 Contributors: @cmunley1 generic langgraph adapter reflection, rewoo, orchestrator, parallel thinking example orchestrator, math: <img width="630" height="212" alt="image" src="https://github.com/user-attachments/assets/ec61bc2d-97f6-4ec6-b78d-fa5dace9cd43" /> --------- Signed-off-by: Christian Munley <cmunley@nvidia.com> Signed-off-by: cmunley1 <cmunley@nvidia.com> Signed-off-by: Christian Munley <cmunley@> Co-authored-by: cmunley1 <cmunley@nvidia.com>
Three mathematical proof environments: * Proof Judge: Evaluating a model-generated proof correctness * Proof Validation: Evaluating a model's judgement of a proof corectness * Proof Genselect: Picking correct proof amonth two given proofs. --------- Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
1. Refactor benchmark config yaml structure to support running multiple benchmarks at a time 2. Support benchmark running through ng_e2e_collect_rollouts 3. Miscellaneous improvements to MCQA 4. Miscellaneous cleanup on old benchmark infra 5. Rename `swap_key` directive to `inherit_from` per suggestion from @gwarmstrong 6. Add docs about suggested benchmark running --------- Signed-off-by: Brian Yu <bxyu@nvidia.com>
…PYTHONPATH (#920) ## Fix: use venv Python for swerl_gen Ray workers Replace the hardcoded `PYTHONPATH="/opt/nemo-rl/3rdparty/Gym-workspace/Gym"` in `singularity_utils.py` with `py_executable: sys.executable`. ### Problem The `compute_score` Ray remote task hardcodes an internal deployment path in its `runtime_env`. This breaks any setup where Gym is installed at a different location. ### Fix Set `py_executable: sys.executable` so Ray workers use the same venv Python as the server process. Since `nemo_gym` is installed as an editable package and `resources_servers` is included in `pyproject.toml`'s package list, all imports resolve correctly from the venv — no `PYTHONPATH` override needed. This is the same pattern applied to `compute_code_generation_metrics.py` in #908. Signed-off-by: spacegoing <spacegoing@gmail.com>
…te fetch (#919) `nltk.download()` always fetches the remote package index even when the data is already present. Guard with a local `find()` first to skip the download when the data already exists. Signed-off-by: spacegoing <spacegoing@gmail.com>
`check_correctness_remote` was decorated with `@ray.remote` without
`runtime_env`, so Ray workers spawned on system Python could not import
`lcb_integration`. The workaround was a symlink from the server's venv
into `/usr/local/lib/python3.12/dist-packages/`.
Fix: add `runtime_env={"py_executable": sys.executable}` to the
decorator, consistent with the pattern already used in `swe_agents`,
`mini_swe_agent`, and `harbor_agent`.
This ensures Ray workers inherit the code_gen server's venv and can
import `lcb_integration` directly, with no system path side effects.
---------
Signed-off-by: spacegoing <spacegoing@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
…equent model calls (#939) Signed-off-by: Brian Yu <bxyu@nvidia.com>
Thanks to @gwarmstrong --------- Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com> Signed-off-by: Brian Yu <bxyu@nvidia.com> Co-authored-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Align with upstream/main to avoid ImportError when pydot is not installed at module load time. Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Mahan Fathi <mahan0fathi@gmail.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Mahan Fathi <mahan0fathi@gmail.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Merge main into branch, resolving conflicts in README.md, nemo_gym/rollout_collection.py, and reasoning_gym config. Reverted non-turing_vif changes (reasoning_gym.yaml, env.example.yaml) to match main. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
The config file is not committed to this branch, so the CI pre-commit hook correctly removes its row from the table. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Empty scores list means no constraints to violate, so the reward should be 1.0 (vacuous truth), not 0.0. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
… fields The metrics format changed upstream (Median fields were removed), causing data validation to fail on the outdated cached metrics. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Generated by running ng_collect_rollouts against the 5 example entries using gcp/google/gemini-3-flash-preview via the NVIDIA inference API. Required for CI data validation step. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
|
|
||
| async def run_from_config(self, config: RolloutCollectionConfig) -> Tuple[List[Dict]]: | ||
| output_fpath = Path(config.output_jsonl_fpath) | ||
| # turing_vif change: sidecar file that collects rollouts excluded from the main output |
There was a problem hiding this comment.
i wonder if this can live in the resources server / agent instead of core
There was a problem hiding this comment.
i'm a little clueless honestly (this is not originally my code). do you think you can apply the changes?
There was a problem hiding this comment.
We should refactor the resources server in this PR so samples do not fail i.e. we expect code quality to be good enough that we have no issue running
There was a problem hiding this comment.
I am thinking we just remove this skip logic, and let reward be 0 if the task failed. The example on line 246 of bad task data or unsupported language/instruction: if reward 0, this is something reward profiling can catch, and filter that way.
There was a problem hiding this comment.
@cmunley1 is this branch a fix to the above? cmunley1/turing_envs_fix
|
|
||
| The server automatically detects the model type and uses the appropriate API. | ||
|
|
||
| ## Dataset Format |
There was a problem hiding this comment.
Where is the dataset for this environment? Is it released publicly? Can you please provide instructions to download and prepare the dataset if it is public, or make it clear if it is not
There was a problem hiding this comment.
this env targets 4 datasets: mulltichallenge, inverse_ifeval, CF-Bench and SysBench. my understanding is that they're all vendor collected and they're not all public yet.
Important
This PR is a successor to #801, branch
feature/nvidia-IF-bench-validators-integrations. The main changes were already there, this branch includes changes we needed to successfully test the environments. Let's merge this PR instead of #801.Important note to @bxyu-nvidia:
rollout_collection.pywas changed in the abovelinked PR, which requires your careful review.Summary of my changes
judge_server_nameconfig field so the judge URL is resolved at runtime from the NeMo-Gym server registry instead of being hardcoded. Enables use oflocal_vllm_model(which actually spins up vLLM via Ray) as the judge server type.profiling.py— Movedgprof2dot/pydotimports insidedump()to avoidModuleNotFoundErroron Ray workers where profiling deps aren't installed.all) reward with a configurableaggregation_modesupportingall,any,mean,min, andmax. Default remainsall(no behavior change).type="reasoning"output items and strip<think>/<thinking>tags before evaluation, preventing chain-of-thought from contaminating validator checks and judge prompts.LLM_JUDGE_QUESTION_PROMPTto present conversation context before the model response, and replaced fragile JSON output format with robust[[YES]]/[[NO]]bracket markers plus multi-tier fallback extraction. Eliminates silent false negatives from JSON parse failures.judge_temperature(default 0.7),judge_top_p(default 0.8), andjudge_max_tokens(default 10000) as config fields, replacing previously hardcoded values.turing_vif.yamlupdated with new fields; README documentedaggregation_mode.All changes are backwards-compatible.
Checklist