Conversation
Signed-off-by: cmunley1 <cmunley@nvidia.com>
WalkthroughThis PR adds support for the Qwen3-VL-2B vision-language model to the NemoGym framework by introducing a new configuration file, extending the policy model registry to recognize the model, adding an Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml (1)
56-58: Consider using the full context window formax_new_tokens.Most GRPO configs in this repo set
max_new_tokenstopolicy.max_total_sequence_lengthand rely on the existing overflow safeguards. Hard-coding1024makes this example behave differently unless that shorter cap is intentional.Suggested change
- max_new_tokens: 1024 + max_new_tokens: ${policy.max_total_sequence_length}Based on learnings, "In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml` around lines 56 - 58, The config hard-codes generation.max_new_tokens to 1024; change it to use the policy's full context window by setting generation.max_new_tokens to policy.max_total_sequence_length so the example matches other GRPO/distillation configs and relies on the overflow safeguards implemented in vllm_worker.py for prompt+generation length handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/nemo_gym/run_grpo_nemo_gym.py`:
- Around line 166-169: Replace the fragile assert with an explicit exception: in
the block that checks is_vlm and "vllm_cfg" in config["policy"]["generation"],
validate that config["policy"]["generation"]["vllm_cfg"]["skip_tokenizer_init"]
is False and if not raise a ValueError with a clear message (e.g. "VLMs require
skip_tokenizer_init=False") instead of using assert; update the check around the
is_vlm / vllm_cfg logic to raise the ValueError so the configuration error
cannot be skipped under python -O.
- Around line 151-157: The code hides the default for policy.is_vlm by using
config["policy"].get("is_vlm", False); change the call site to read the value
directly (is_vlm = config["policy"]["is_vlm"]) and keep the downstream logic
that selects get_tokenizer/get_processor (processor, tokenizer) the same, and
ensure the YAML/config (policy.is_vlm) contains an explicit default so the key
is always present rather than relying on a code-side default.
In `@nemo_rl/models/policy/__init__.py`:
- Line 295: Add inline documentation for the new TypedDict key
PolicyConfig.is_vlm: annotate the line where is_vlm: NotRequired[bool] with a
short comment describing its purpose (what enabling "vlm" does), the valid type
(bool), the recommended default (e.g., False) and expected behavior when
omitted, and keep the NotRequired typing as-is; then update the canonical
example config files under examples/configs/*.yaml to include the key set to the
documented default so examples reflect the intended default behavior.
---
Nitpick comments:
In `@examples/nemo_gym/grpo_circle_click_qwen3vl2b.yaml`:
- Around line 56-58: The config hard-codes generation.max_new_tokens to 1024;
change it to use the policy's full context window by setting
generation.max_new_tokens to policy.max_total_sequence_length so the example
matches other GRPO/distillation configs and relies on the overflow safeguards
implemented in vllm_worker.py for prompt+generation length handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6ef38dad-bc0d-4454-b745-835029dfff2b
📒 Files selected for processing (4)
examples/nemo_gym/grpo_circle_click_qwen3vl2b.yamlexamples/nemo_gym/run_grpo_nemo_gym.pynemo_rl/models/policy/__init__.pynemo_rl/models/policy/utils.py
| is_vlm = config["policy"].get("is_vlm", False) | ||
| if is_vlm: | ||
| processor = get_tokenizer(config["policy"]["tokenizer"], get_processor=True) | ||
| tokenizer = processor.tokenizer | ||
| else: | ||
| processor = None | ||
| tokenizer = get_tokenizer(config["policy"]["tokenizer"]) |
There was a problem hiding this comment.
Don't hide policy.is_vlm behind a code-side default.
config["policy"].get("is_vlm", False) makes a missing key indistinguishable from an explicit false, and it moves the default out of YAML. Set the default in config and read it directly here.
Suggested change
- is_vlm = config["policy"].get("is_vlm", False)
+ is_vlm = config["policy"]["is_vlm"]As per coding guidelines, "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values" and "Access required config values directly (e.g., policy_cfg['precision']) and assume they are present; do not introduce hidden defaults in code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/nemo_gym/run_grpo_nemo_gym.py` around lines 151 - 157, The code
hides the default for policy.is_vlm by using config["policy"].get("is_vlm",
False); change the call site to read the value directly (is_vlm =
config["policy"]["is_vlm"]) and keep the downstream logic that selects
get_tokenizer/get_processor (processor, tokenizer) the same, and ensure the
YAML/config (policy.is_vlm) contains an explicit default so the key is always
present rather than relying on a code-side default.
| | SchedulerMilestones | ||
| | None | ||
| ] | ||
| is_vlm: NotRequired[bool] |
There was a problem hiding this comment.
Document the new PolicyConfig.is_vlm key.
This adds a public config knob, but the field still has no inline documentation for its purpose, valid values, or expected default. Please document it here and make sure a canonical example config reflects that default.
As per coding guidelines, "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/models/policy/__init__.py` at line 295, Add inline documentation for
the new TypedDict key PolicyConfig.is_vlm: annotate the line where is_vlm:
NotRequired[bool] with a short comment describing its purpose (what enabling
"vlm" does), the valid type (bool), the recommended default (e.g., False) and
expected behavior when omitted, and keep the NotRequired typing as-is; then
update the canonical example config files under examples/configs/*.yaml to
include the key set to the documented default so examples reflect the intended
default behavior.
terrykong
left a comment
There was a problem hiding this comment.
awesome! @yfw @aroshanghias-nvd can you review?
|
cc @ksapra @DanialTaheri if you have any comments or questions |
|
|
||
| # NeMo-Gym environment needs to get dp_openai_server_base_urls from policy_generation, so we don't setup env here. | ||
| print("\n▶ Setting up data...") | ||
| train_dataset, val_dataset = setup_response_data( |
There was a problem hiding this comment.
setup_response_data accepts is_vlm flag. Should we pass it here?
There was a problem hiding this comment.
i believe since env_configs is None for Nemo Gym, is_vlm wont be used in setup_response_data, so this wouldn't change anything as is. Maybe a larger change should incorporate is_vlm here but maybe a separate PR?
aroshanghias-nvd
left a comment
There was a problem hiding this comment.
Looks good to me. Just one minor comment added.
There was a problem hiding this comment.
this is okay here if the goal is just to demo, but we ideally want every config we expect people to run to follow this practice
RL/tests/test_suites/README.md
Lines 50 to 52 in bc8aa39
| | SchedulerMilestones | ||
| | None | ||
| ] | ||
| is_vlm: NotRequired[bool] |
There was a problem hiding this comment.
i understand why this is added, but can we potentially re-use this arg to avoid having too many args doing the same thing
RL/nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Lines 224 to 225 in bc8aa39
my preference is with that arg b/c it also doesn't shoehorn this arg b/c we will do ALM/OMNI in the future
What does this PR do ?
Adds VLM support to Nemo Gym integration. Training run shown uses the circle click environment in NeMo Gym https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/circle_click
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit