feat: nemo gym vlm support by cmunley1 · Pull Request #2092 · NVIDIA-NeMo/RL

cmunley1 · 2026-03-09T19:16:02Z

What does this PR do ?

Adds VLM support to Nemo Gym integration. Training run shown uses the circle click environment in NeMo Gym https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/circle_click

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added support for vision-language models (VLMs) in training pipelines
- Added new example configuration for circle-click task using Qwen3-VL model with VLLM-based generation backend
- Enabled distributed setup support with tokenizer and processor handling for VLM configurations

Signed-off-by: cmunley1 <cmunley@nvidia.com>

copy-pr-bot · 2026-03-09T19:22:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-09T19:22:40Z

Walkthrough

This PR adds support for the Qwen3-VL-2B vision-language model to the NemoGym framework by introducing a new configuration file, extending the policy model registry to recognize the model, adding an is_vlm flag to the policy configuration schema, and updating the training setup to handle VLM-specific processor initialization.

Changes

Cohort / File(s)	Summary
Model Support `nemo_rl/models/policy/__init__.py`, `nemo_rl/models/policy/utils.py`	Extended PolicyConfig with optional `is_vlm` field and registered "qwen3_vl" model in AUTOMODEL_FACTORY mappings for both standard and NeMo model paths.
Configuration & Training `examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml`, `examples/nemo_gym/run_grpo_nemo_gym.py`	Added new YAML configuration file defining Grpo workspace setup for NemoGym with Qwen3-VL-2B model, and updated training script to conditionally obtain and pass VLM processor to setup function based on `is_vlm` flag.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR adds VLM support but description lacks test results, performance metrics, or verification that the new feature works correctly.	Add test results demonstrating VLM models work with NemoGym, include convergence metrics, verify end-to-end configuration testing, and document which test suites passed.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: nemo gym vlm support' directly and accurately describes the main change—adding Vision-Language Model support to the Nemo Gym integration.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cmunley1/gym-vlm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml (1)
56-58: Consider using the full context window for max_new_tokens.

Most GRPO configs in this repo set max_new_tokens to policy.max_total_sequence_length and rely on the existing overflow safeguards. Hard-coding 1024 makes this example behave differently unless that shorter cap is intentional.
Suggested change
-    max_new_tokens: 1024
+    max_new_tokens: ${policy.max_total_sequence_length}
Based on learnings, "In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml` around lines 56 - 58, The
config hard-codes generation.max_new_tokens to 1024; change it to use the
policy's full context window by setting generation.max_new_tokens to
policy.max_total_sequence_length so the example matches other GRPO/distillation
configs and relies on the overflow safeguards implemented in vllm_worker.py for
prompt+generation length handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/nemo_gym/run_grpo_nemo_gym.py`:
- Around line 166-169: Replace the fragile assert with an explicit exception: in
the block that checks is_vlm and "vllm_cfg" in config["policy"]["generation"],
validate that config["policy"]["generation"]["vllm_cfg"]["skip_tokenizer_init"]
is False and if not raise a ValueError with a clear message (e.g. "VLMs require
skip_tokenizer_init=False") instead of using assert; update the check around the
is_vlm / vllm_cfg logic to raise the ValueError so the configuration error
cannot be skipped under python -O.
- Around line 151-157: The code hides the default for policy.is_vlm by using
config["policy"].get("is_vlm", False); change the call site to read the value
directly (is_vlm = config["policy"]["is_vlm"]) and keep the downstream logic
that selects get_tokenizer/get_processor (processor, tokenizer) the same, and
ensure the YAML/config (policy.is_vlm) contains an explicit default so the key
is always present rather than relying on a code-side default.

In `@nemo_rl/models/policy/__init__.py`:
- Line 295: Add inline documentation for the new TypedDict key
PolicyConfig.is_vlm: annotate the line where is_vlm: NotRequired[bool] with a
short comment describing its purpose (what enabling "vlm" does), the valid type
(bool), the recommended default (e.g., False) and expected behavior when
omitted, and keep the NotRequired typing as-is; then update the canonical
example config files under examples/configs/*.yaml to include the key set to the
documented default so examples reflect the intended default behavior.

---

Nitpick comments:
In `@examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml`:
- Around line 56-58: The config hard-codes generation.max_new_tokens to 1024;
change it to use the policy's full context window by setting
generation.max_new_tokens to policy.max_total_sequence_length so the example
matches other GRPO/distillation configs and relies on the overflow safeguards
implemented in vllm_worker.py for prompt+generation length handling.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ef38dad-bc0d-4454-b745-835029dfff2b

📥 Commits

Reviewing files that changed from the base of the PR and between 24cb91e and 2194d45.

📒 Files selected for processing (4)

examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml
examples/nemo_gym/run_grpo_nemo_gym.py
nemo_rl/models/policy/__init__.py
nemo_rl/models/policy/utils.py

coderabbitai · 2026-03-09T19:22:43Z

+    is_vlm = config["policy"].get("is_vlm", False)
+    if is_vlm:
+        processor = get_tokenizer(config["policy"]["tokenizer"], get_processor=True)
+        tokenizer = processor.tokenizer
+    else:
+        processor = None
+        tokenizer = get_tokenizer(config["policy"]["tokenizer"])


⚠️ Potential issue | 🟡 Minor

Don't hide policy.is_vlm behind a code-side default.

config["policy"].get("is_vlm", False) makes a missing key indistinguishable from an explicit false, and it moves the default out of YAML. Set the default in config and read it directly here.

Suggested change

- is_vlm = config["policy"].get("is_vlm", False) + is_vlm = config["policy"]["is_vlm"]

As per coding guidelines, "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values" and "Access required config values directly (e.g., policy_cfg['precision']) and assume they are present; do not introduce hidden defaults in code."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/nemo_gym/run_grpo_nemo_gym.py` around lines 151 - 157, The code hides the default for policy.is_vlm by using config["policy"].get("is_vlm", False); change the call site to read the value directly (is_vlm = config["policy"]["is_vlm"]) and keep the downstream logic that selects get_tokenizer/get_processor (processor, tokenizer) the same, and ensure the YAML/config (policy.is_vlm) contains an explicit default so the key is always present rather than relying on a code-side default.

coderabbitai · 2026-03-09T19:22:44Z

        | SchedulerMilestones
        | None
    ]
+    is_vlm: NotRequired[bool]


⚠️ Potential issue | 🟡 Minor

Document the new PolicyConfig.is_vlm key.

This adds a public config knob, but the field still has no inline documentation for its purpose, valid values, or expected default. Please document it here and make sure a canonical example config reflects that default.

As per coding guidelines, "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/models/policy/__init__.py` at line 295, Add inline documentation for the new TypedDict key PolicyConfig.is_vlm: annotate the line where is_vlm: NotRequired[bool] with a short comment describing its purpose (what enabling "vlm" does), the valid type (bool), the recommended default (e.g., False) and expected behavior when omitted, and keep the NotRequired typing as-is; then update the canonical example config files under examples/configs/*.yaml to include the key set to the documented default so examples reflect the intended default behavior.

terrykong

awesome! @yfw @aroshanghias-nvd can you review?

cmunley1 · 2026-03-09T21:47:01Z

cc @ksapra @DanialTaheri if you have any comments or questions

aroshanghias-nvd · 2026-03-10T16:14:00Z


    # NeMo-Gym environment needs to get dp_openai_server_base_urls from policy_generation, so we don't setup env here.
    print("\n▶ Setting up data...")
    train_dataset, val_dataset = setup_response_data(


setup_response_data accepts is_vlm flag. Should we pass it here?

i believe since env_configs is None for Nemo Gym, is_vlm wont be used in setup_response_data, so this wouldn't change anything as is. Maybe a larger change should incorporate is_vlm here but maybe a separate PR?

aroshanghias-nvd

Looks good to me. Just one minor comment added.

terrykong · 2026-03-31T06:14:53Z

this is okay here if the goal is just to demo, but we ideally want every config we expect people to run to follow this practice

RL/tests/test_suites/README.md

Lines 50 to 52 in bc8aa39

- Place driver scripts under `tests/test_suites/llm/` or `tests/test_suites/vlm/`.

- The script filename should mirror the YAML recipe filename under `examples/configs/recipes/**` but with a `.sh` suffix.

- Add the relative script path to `tests/test_suites/nightly.txt` for nightly execution.

so that it's at least smoke tested

terrykong · 2026-03-31T06:24:00Z

        | SchedulerMilestones
        | None
    ]
+    is_vlm: NotRequired[bool]


i understand why this is added, but can we potentially re-use this arg to avoid having too many args doing the same thing

RL/nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Lines 224 to 225 in bc8aa39

use_processor = config["tokenizer"].get("use_processor", False)

result = get_tokenizer(config["tokenizer"], get_processor=use_processor)

my preference is with that arg b/c it also doesn't shoehorn this arg b/c we will do ALM/OMNI in the future

nemo gym vlm

2194d45

Signed-off-by: cmunley1 <cmunley@nvidia.com>

cmunley1 requested review from a team as code owners March 9, 2026 19:16

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

cmunley1 requested review from bxyu-nvidia and terrykong March 9, 2026 20:04

terrykong reviewed Mar 9, 2026

View reviewed changes

terrykong requested review from aroshanghias-nvd and yfw March 9, 2026 21:35

terrykong added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Mar 9, 2026

terrykong enabled auto-merge (squash) March 9, 2026 21:35

bxyu-nvidia approved these changes Mar 9, 2026

View reviewed changes

Merge branch 'main' into cmunley1/gym-vlm

fcdaec6

aroshanghias-nvd reviewed Mar 10, 2026

View reviewed changes

aroshanghias-nvd approved these changes Mar 10, 2026

View reviewed changes

terrykong reviewed Mar 31, 2026

View reviewed changes

	- Place driver scripts under `tests/test_suites/llm/` or `tests/test_suites/vlm/`.
	- The script filename should mirror the YAML recipe filename under `examples/configs/recipes/**` but with a `.sh` suffix.
	- Add the relative script path to `tests/test_suites/nightly.txt` for nightly execution.

	use_processor = config["tokenizer"].get("use_processor", False)
	result = get_tokenizer(config["tokenizer"], get_processor=use_processor)

Conversation

cmunley1 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Mar 9, 2026

Uh oh!

coderabbitai Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

cmunley1 commented Mar 9, 2026

Uh oh!

aroshanghias-nvd Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

cmunley1 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

aroshanghias-nvd left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cmunley1 commented Mar 9, 2026 •

edited

Loading

coderabbitai Bot commented Mar 9, 2026 •

edited

Loading